The post makes it a bit unclear, so for the record: there's no theoretical reason the direction flag should be a problem. Just like any other callee-saved register, the kernel needs to save it to the stack and restore it when returning from the signal handler; there's nothing about the direction flag that makes it harder to do so, even if a signal handler is interrupting another signal handler or whatnot. And before branching to the signal handler function, just like it sets the registers used for function parameters to the correct values for the function signature, just like it aligns the stack pointer, the kernel (or a libc stub) needs to set the direction flag to 0. This is all defined by the ABI specification.
It's just that some kernels fail to do this (or did in the past), probably because the direction-flag requirement is less well known and most programs won't crash if it's neglected.
I'm confused too. On x86, EFLAGS (of which the direction flag is bit 10) is absolutely preserved across signal delivery. It's the spot where the comparison result bits go, so if the issue in the linked blog post was real, it would be impossible to deliver a signal across a test/jump pair, making basically all compiled code signal-unsafe.
I don't know what historical architectures may have had a bug with this, but it's not true of x86. If memset isn't signal-safe on modern x86 linux, it's surely not because of EFLAGS state management.
There are a whole bunch of almost-correct comments here along with a surprising number of "it's been so long it must be fixed".
The alleged bug is that Linux didn't clear DF on signal entry. (This has nothing to do with what is saved or restored. Flags have to be saved and restored and, AFAIK, always were.) The x86 ABI is crystal clear: C functions are called with DF clear. Neither glibc nor Linux cleared it before calling a signal handler, so there was a bug.
But the bug was fixed in March 2008 for Linux 2.6.25. [1]
POSIX doesn't guarantee that memset is async-signal-safe, that is clear enough, and flags like this are even a reasonable rationale. But it sounds like post-2008 memset is and should be async-signal-safe on Linux. Are there relevant systems on which it isn't? IME it's impractical to write code that will work on all POSIX-compliant systems as the standard leaves too much undefined and there is no testsuite that will let you determine when your code is depending on something beyond the minimum requirements of POSIX.
> As for GNU/Linux, can you assure memset works that way in all hardware platforms supported by Linux?
I don't know, that's why I'm asking.
I'm not going to try to write code portable to all theoretically possible unices - I doubt anyone has managed that for nontrivial programs, and it's too hard to tell. Hell, I'm content to exclude a fair few systems I know exist (CHAR_BIT != 8, non-IEEE FP). If it's supported on all the platforms I've heard of I'll do it - it's just not practical to hope to be POSIX-complient in a language-lawyer sense.
> I doubt anyone has managed that for nontrivial programs, and it's too hard to tell.
In the early 2000's, the company I worked for was deploying UNIX based software for GNU/Linux, FreeBSD, HP-UX, Aix, Solaris and Windows (yes Windows, not a typo).
Sure, writing a program that works on 6 actually existing unices with test instances (or even just documentation) is relatively easy. Writing one for all possible unices including some that don't exist at the time of writing is a lot harder.
Note that (a) the issue was with clearing the direction flag on signal handler entry, not saving it; and (b) it's since been fixed in the kernel to conform to the ABI (which GCC blindly trusted) [1].
And after reading that thread, I'm not as convinced as I was a few minutes ago that this was an obvious kernel issue. Yes, the kernel mismatched the published ABI, but callee-vs-caller save and setup is not always consistent, and for good reason.
E.g. registers are usually callee-save, since there are many registers and small functions use few of them. (This is what the kernel assumed.) But rarely-set/often-used flags (such as the flag in question) may make more sense as caller-save and even caller-setup, since this reduces save/setup overhead to only the cases where the flag is actually set. (This is what the ABI dictated and GCC assumed.)
First, I don't think that's true nowadays. The discussion points that this (not restoring DF on signal return) is a kernel bug and should be fixed: https://lkml.org/lkml/2008/3/5/531
Second, the kernel tries to avoid doing too much work in the signal return code. It tries hard do _avoid_ heavy XSAVE and just preserve only the needed registers. This is the job of sigreturn(2)
syscall btw. http://man7.org/linux/man-pages/man2/sigreturn.2.html
The gist: if you modify SS register in the signal handler you are screwed. The only way around it is to install a trampoline using "the famous dosemu iret hack" described here:
> Last one was merged into mainline. Then reversed. Sadly. > Then applied again: https://github.com/torvalds/linux/commit/6c25da5ad55d48c41b8....
>
> The gist: if you modify SS register in the signal handler you are screwed. The only way around it is to install a trampoline using "the famous dosemu iret hack" described here:
I'm not sure what you mean. This issue is fixed, so SS works exactly the way you would expect it to, unless you do very strange things indeed in which case you might need to fiddle with uc_flags.
Strange or not, linux did not restore SS on sigreturn in 64 bit mode. The point of explaining that is to emphasize that the issue described in the blog post is not the only one in this code (the code doing entering / exiting signal handlers).
Okay, I read your post as meaning that it was still broken.
FWIW, I don't consider it sad that my first attempt was reverted. I inadvertently broke some assumptions that a real program (DOSEMU) was making, and Linux takes backwards compatibility quite seriously. The second version was better.
The site is down again? What's wrong with that site? A while back when I wanted to look something up in the x86_64 ABI the site was down too, for at least a week (I eventually gave up checking). At some point later it came back. I wonder how long it's been down this time.
Interestingly, Xcode's OS X docs used to link to x86-64.org for the AMD64 ABI, but looking right now, the link is removed (but the text remains), I guess because the site was down for so long.
I mean, kernel bug or no, if that's how it actually works, then it isn't actually safe to be handle signals while executing these functions. If people are still working to change the code and standards we hold the code to, then you can't push changes that rely on the new behavior without introducing some truly bizarre race conditions.
I mean, that's pretty optimistic, but perhaps: At some point you can assume that, if people get your updated code in 2017, they'll have an update from many years ago.
Depends on how releases are managed, but that might be much much more practical an assumption.
Seems like a good argument against an architecture having stateful flags that affect the execution of other instructions. Or, at least, against having such flags and not including them in the state saved and restored when switching contexts (including to signals).
No disagreement there. In a new OS that doesn't need 100% POSIX compatibility, I'd eliminate the standard signal model, and only have signalfd (plus a SIGKILL equivalent and a separate stop/continue mechanism similar to SIGSTOP/SIGCONT).
How would you handle being able to trap and fix bad memory access (i.e. hooking SIGSEGV/SIGBUS) with this scheme? Lots of programs use this for various reasons; off the top of my head at least a few generational GCs implement their write barrier by setting pages read-only and noticing the trap and resetting RW on SEGV...
(Yes, I believe something like card marking is more performant than this in several ways on modern CPUs, but then you need to insert card marking instructions into your code generator... trapping SEGV is easy and only happens in the GC.)
Also there's all the debugging and process inspection stuff that works via signals, all of which depends on being able to really interrupt code rather than just stuff a message in a queue.
For trapping memory accesses, either something like userfaultfd, or otherwise handling segfaults via a signalfd from another thread with its own independent stack. (I'd also eliminate limitations about using a signalfd only within the same process that would have received the signals.)
Or, if you just need a "dirty page" bit, add a dedicated mechanism for that, which can use hardware features to run much faster without having to trap.
For debugging, we can do much better than ptrace. Linux already has dedicated syscalls to read and write another process's memory. Add some mechanisms to read and write registers, and extend the process stop/continue mechanism to allow single-stepping and stop-on-event (such as stop-on-syscall, or BPF-based filtering). I don't see any reason why debugging a process needs to incorporate signals.
How so? The other thread would handle the signal from a well-defined point (reading the signalfd), and the thread that faulted would stop at the point of the fault until the thread processing the signal let it continue.
What if the stopped thread was holding a mutex? The thread handling the signalfd can only safely call async signal safe functions, exactly like a signal handler.
edit: technically of course full reentrancy (implied by async signal safety) is not strictly required, "only" a fully non-blocking implementation of every function called by the signalfd thread.
Ah, I see your concern. Right, if you called a function that took locks and then faulted, the thread handling the fault can't attempt to take the same locks. That should result in far fewer restrictions than "async-signal-safe", though.
You do not know which lock it was holding though; when handling a segfault for example you need to be pessimistic and assume that you can't touch any lock (think of the allocator lock for example).
Async signal safety implies both reentrancy and non blocking algorithms [1]. You might not need reentrancy but you do need non-blocking. That's really a significant restriction as libraries with non-blocking guarantees are rare.
Depending on the nature of your segfault handler, you could either make sure you have any data structures you need already allocated, or allocate out of a separate arena. Or, alternatively, handle the signal in another process entirely.
That's an extreme perspective. I've made extensive use of signals in programs that have shipped to hundreds of millions of users. If you're careful, they work fine. Yes, POSIX needs something like NT's vectored exception handlers that could allow multiple users of the same signal to cooperate: but that's an API problem, not something inherently "unfixable" about a program being interrupted and temporarily doing something else for a bit.
Unix signal are an OS try at giving user a portable abstraction for software interrupts (HW Int & trap). It confusingly employs much of the jargon (masking, priorities, bottom halves...)
I have the intuition knowing the HW/ASM & the code but ignoring how POSIX works (I read stevens, but I am not having all my answers) that unices reimplement in SW what the HW does with wires.
Maybe this could be fixable at the HW level? (and maybe I guess requiring kernel privileges)?
Something that's always baffled me: why don't CPUs have a "save ALL state" and "restore ALL state" instructions? Why does every new set of CPU registers seem to require an OS update to save them on context switches?
They do, now. On current Intel CPUs, you can use xsave and xrstor to save and load the complete state, including all new state information. Ring 0 code can ask the CPU for the size of that state (via CPUID leaf 0xd), and allocate the appropriate amount of space per task.
Oh wow. When did this change? I vaguely seem to remember that as recently as Windows 7 (or was it 8.0?) there was trouble with AVX2 or something, but I can't find the info anywhere at the moment.
An OS can still use xsave incorrectly, such as by hardcoding the expected size rather than detecting it at runtime, or by getting some aspect of the CPUID leaf 0xd enumeration wrong. I wouldn't find it surprising if an initial implementation got one of the details wrong, resulting in a bug that wouldn't manifest until the next time the xsave layout changed.
After some careful reading of the linked bug report, apparently saving the direction flag wasn't the issue. Rather, clearing it upon entering a signal handler was the issue.
When the number of state variables gets bigger, that buffer needs to get bigger. Need some cooperation from the operating system to increase the size of the buffers, that's all.
I don't remember how many there were for AVX, but let's say there are 8 512-bit registers (4 KiB)? and then equivalent in other kinds of registers, so 8 KiB. Just an order-of-magnitude estimate, nothing I expect to be too accurate.
I'm also not an expert, but I think the bigger cost here is memory latency. Size correlates to, but scales differently than the switching cost of registers because it's zero sum. A register saved is a register waited on, twice. There's also the cost of decoding and executing the instructions to store / restore the register values on both ends.
I don't get it, you have to save and restore all the state on context switches either way. The only question is whether you're doing it with a generic instruction or through some other more specialized instructions. It's not a question of whether they should be saved and restored at all.
That said, I'm confused why you replied to my comment above, since it's totally off-topic. This thread chain was asking how much state exists; maybe you were trying to reply to another thread?
> I don't get it, you have to save and restore all the state on context switches either way.
Actually, no, you don't, at least not on _all_ context switches. What you have to save is what the switched-to routine will overwrite. For something like an interrupt handler (where latency often matters very much), if you know it will only modify, for example, eflags and EAX, then you only need save on entry and restore on exit from the handler eflags and EAX. The registers that are not modified remain identical from entry to exit and time is saved by not pushing/popping them needlessly.
> I don't get it, you have to save and restore all the state on context switches either way.
Actually, you don't. I'm not up to date as to what's done in this direction by current kernels, but a while back, a patch was merged that disabled floating point and MMX operations by default for a process, and enabled it for a while only when some FP or MMX instruction lead to an exception, which then allowed the kernel to avoid saving and restoring FP state on processes that weren't actually doing any FP/MMX operations.
This article seems flat out wrong. Since the direction flag is correctly saved and restored, everything is cool.
When some code is interrupted and the signal or interrupt handler calls memmove, that memmove will set up the direction flag for itself correctly. Its entire execution is nested within the handler. If it is interrupted by a nested interrupt, that nested one will restore the flag.
Now if an implementation of the memset function happens not to care about that flag, so that it changes direction from call to call, that's not a signal or interrupt problem! The flag can have arbitrary value in on entry to memcpy in an ordinary situation not involving threads or signals.
The moral is: never use the looping primitives on Intel without setting up the direction flag, if you care about reproducibility.
It goes without saying that the flag is part of the machine state; any machine context saving mechanism (for async situations) is broken if it neglects that flag.
> The flag can have arbitrary value in on entry to memcpy in an ordinary situation not involving threads or signals.
It can not have arbitrary value on entry. x86-64 ABI mandates that the flag is cleared before any function is called (3.2.1 Registers and the Stack Frame: "The direction flag in the %eflags register must be clear on function entry, and on function return.")
OK there we go, then. A compiler that conforms to the ABI generates code which clears the flag, if necessary. Interrupts preserve the flag. Thus, the flag should not be surprisingly set on entry into memcpy. If so, there is a bug.
> A compiler that conforms to the ABI generates code which clears the flag, if necessary.
I'm sorry, I do not follow. The conforming compiler expects that the functions are called with the clear flag. The non-conforming kernel on the other hand does not clear the flag before calling the signal handler which breaks it.
The terrifying thing is while the programmer might know that memset() isn't async-signal-safe and not use it in a signal handler, the compiler may blissfully and silently optimize the code to use memset() anyways. Odd crashes or worse security leaks may result.
Reminds me a bit of old floating point implemented in software. Some implementations had global scratchpad registers. Very weird things happened if you did any floating point operations in multiple threads.
If memset is actually not async signal safe on some target, and a compiler targetting the same does that transformation, the end results are very unsound...
It isn't POSIX, but Linux does have the very useful signalfd, which creates a file descriptor that accepts signals. This is a good solution for some types of programs.
It is an easy solution to the problem discussed above though. Yeah, all the mechanics of setting up your signals still are a PITA, but at least you don't have to worry about async-safe functions anymore.
Really all signalfd does is provide an easy way to handle signals in an epoll driven application. If you have a different kind of main loop, then you are probably back to stuffing messages into queues, or dealing with all of the async-safe BS.
Making sure the Direction flag is clear on entry and return is a somewhat hidden requirement of stdcall (Windows calling convention), but it didn't crash if you set it in WindowProc and forgot to clear it until Windows XP.
More recently, Vista had (accidentily?) 16-byte aligned stacks when using OpenMP with MingW, but the exact same binary crashes on Windows 10 when it tries to use unaligned SSE instructions.
I was under impression that signals are delivered to a process only when process is executing a syscall.
Given the fact that memset or memmove are note executing any syscalls they will not be interrupted
by a signal handler.
Am i wrong?
Async signal safety refers to the functions being called in the handler, not the interrupted code. Also signals can interrupt the code just as normal thread preemption would, not only while the thread is inside a system call.
Returning from the signal handler occurs in user space. It's the C library's handling of signal save and return that matters here, not what the kernel does.
You're right; "sigreturn" wasn't in early UNIX systems but appeared in 4.3BSD. That job can also be done in user space, but there's a timing window against other signals if done that way.
This (signal) is the de-facto communication mechanism to notify a program that an unexpected event just occurred (typically, a SEGV). This mechanism needs to interrupt the normal program flow (because you can not restart an invalid access most of the time), possibly interrupting a C-library call (such as malloc, reason why malloc is not async-signal-safe generally). What would you do instead ? On Windows systems, there is a built-in "__try / __except" mechanism ("structured exception handling", SEH), which provides exception-like catch in plain C with proprietary extension to C. This is probably not such a better method.
It actually really, really is. At least from a technical perspective on x64. Having language-level lexical scoping of system-level exceptions is incredibly useful and once you've grokked how NT's trap handler, PE .xdata sections, prologues and epilogues all work in concert, signals just seem barbaric.
Here's an example trapping an access violation:
//
// Prefault the page.
//
TRY_MAPPED_MEMORY_OP {
TraceStore->Rtl->PrefaultPages(PrefaultMemoryMap->NextAddress, 1);
} CATCH_EXCEPTION_ACCESS_VIOLATION {
//
// This will happen if servicing the prefault off-core has taken longer
// for the originating core (the one that submitted the prefault work)
// to consume the entire memory map, then *another* memory map, which
// will retire the memory map backing this prefault address, which
// results in the address being invalidated, which results in an access
// violation when we try and read/prefault it from the thread pool.
//
TraceStore->Stats->AccessViolationsEncounteredDuringAsyncPrefault++;
}
Or a page fault that has occurred against a memory map backed file (because, say, the underlying network drive has been disconnected):
TRY_MAPPED_MEMORY_OP {
//
// Copy the caller's address range structure over.
//
__movsq((PDWORD64)NewAddressRange,
(PDWORD64)AddressRange,
sizeof(*NewAddressRange) >> 3);
//
// If there's an existing address range set, update its ValidTo
// timestamp.
//
if (TraceStore->AddressRange) {
TraceStore->AddressRange->Timestamp.ValidTo.QuadPart = (
Timestamp.QuadPart
);
}
//
// Update the trace store's address range pointer.
//
TraceStore->AddressRange = NewAddressRange;
} CATCH_STATUS_IN_PAGE_ERROR {
//
// We'll leak the address range we just allocated here, but a copy
// failure is indicative of much bigger issues (drive full, network
// map disappearing) than leaking ~32 bytes, so we don't attempt to
// roll back the allocation.
//
return FALSE;
}
POSIX cares not one jot about Intel's CPU implementation, the assumption that memset is not async-signal safe because of something specific to x86 is ludicrous. As of IEEE Std 1003.1-2008, 2016 Edition memset() and lots of others beside are now Async-Signal safe, see http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2.... That's in contrast to IEEE Std 1003.1, 2004 Edition which has a much shorter list which excludes memset(), see http://pubs.opengroup.org/onlinepubs/007904975/functions/xsh...
Any time anybody tells you that POSIX isn't terrible, the subject of this article is the sort of thing you need to bear in mind.
(For a long time, I thought that literally the only thing that POSIX didn't fuck up is the way they don't have any analogue to MAXIMUM_WAIT_OBJECTS. But then, after I gained more experience with POSIX, I realised that even if I don't understand how, this too was probably also something they got wrong.)
Posix is terrible. But I'm not sure we have anything better.
MAXIMUM_WAIT_OBJECTS is also terrible of course, but even apart from that Win32 is not bright, at all. The doc contains so little details (or worse is sometime even false) that the real doc when you start to ask serious questions is ReactOS, Wine, or even the NT/2000 source leaks, and then IDA. Compared to that, Posix is actually documented and implemented mostly correctly by tons of OSes. Its easy to ship an API with basically no spec (so nobody can tell you that there is a bug when they find one - actually even if there were some real specs about Win32 I think there is public no way to report bugs in there to MS!), and that does not have to be compatible with any other implementation...
It's just that some kernels fail to do this (or did in the past), probably because the direction-flag requirement is less well known and most programs won't crash if it's neglected.