Idea for improvement: Instead of directly jumping to the supplied function, jump to something that first calls the user-supplied function, and then munmaps stack and syscalls exit.
The reason I've heard for this ordering (or why BX is out of alphabetic order) is that in 16-bit x86, it's the only "splittable" (into BH and BL) register that can be used to access memory, so it was grouped together with the last 4 (which can also be used to access memory) in order to simplify the logic.
All the first four registers are splittable and can be used in all addressing modes[1], apart from CISCy instructions like loop and jcx that implicitly use CX (count) register, string operations, xlat, imul etc.
My personal suspicion is that it's something to do with bit ordering. If you reverse the order of bits from LSB to MSB, then BX and CX switch places.
[1] Indexed addressing modes do need to use BX or BP as a base on 8086. Indexed modes are a bit CISCy though, considering they are often used with lea to do non-memory calculations.
The bits used to represent each register in 8086 machine code, when sorted as binary numbers, come out in this order.
Where the bits go depend on the instruction and addressing mode. They generally follow this pattern; some addressing modes don't support some registers though.
According to wikipedia, the 8080 already had a B register (its registers were A, B, C, D, E, H and L, the (BC), (DE) and (HL) pairs could be used as 16 bits registers)
So you could use this to implement fork without sharing signal handlers and files? Given that the latter two seem like the biggest issue with fork why is - to my knowledge - nobody using this?
Numerous people are, though not via raw assembly. You can call clone() via glibc and pass all those same flags; you can also call unshare() from the child process. Take a look at the manpage for clone(), which discusses the difference between the raw syscall and the glibc wrapper. The former works like fork(), so if you use it to spawn a thread rather than a process, you need an assembly wrapper to handle running on a new stack and similar after it returns; the glibc wrapper handles that for you, calling a function pointer you provide:
/* Prototype for the glibc wrapper function */
#include <sched.h>
int clone(int (*fn)(void *), void *child_stack,
int flags, void *arg, ...
/* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );
/* Prototype for the raw system call */
long clone(unsigned long flags, void *child_stack,
void *ptid, void *ctid,
struct pt_regs *regs);
clone() of a separate process with the various namespace flags (such as creating new network or filesystem namespaces) forms the basis of all Linux container solutions (including Docker and Rocket). clone() of a thread with unusual namespace flags is less common, but not unheard-of.
At least for files, it's also common to just open your files with O_CLOEXEC these days, which (if you're consistent about doing it) protects you in case one of your app's libraries forks, and allows you to avoid dropping down to clone().
> So you could use this to implement fork without sharing signal handlers and files?
I think the terminology is a bit confusing here. You can share signal handlers and files, that is, changes in the child are reflected in the parent and vice versa, or you can inherit signal handlers and files: the child starts with a copy of the parent's signal handlers and file descriptor table. Inheritance is the default behavior from fork(). There is no option to reset signal handlers or start with a clean file descriptor table; you have to reset everything yourself.
Linux approach is very fast, and convenient for most cases. The only regret I have (had) is that working in that way, it brings some problems like different behavior on signal() versus most POSIX implementations (e.g. signal handlers on a thread not being the main one, etc.). Not really a problem, but a bit annoying when porting UNIX code to Linux.
The signal behaviour required for POSIX threads is available by specifying the CLONE_SIGHAND flag to clone(2). This is what the glibc pthreads implementation does.
Linux did once have a pthreads implementation that wasn't POSIX-conforming in these sorts of areas ("LinuxThreads"), but those days are long behind us. NPTL has been the standard for at least 10 years now.
In general yes, but the code in this case appears to be supplying its own version of exit, that just does a SYSCALL instruction. The SYSCALL instruction doesn't need a stack.
Or you just don't have signal-handling functions. Signals only touch the stack if you handle them via functions. The default signal dispositions all either kill the process or do nothing, so they don't touch the stack.
Idea for improvement: Instead of directly jumping to the supplied function, jump to something that first calls the user-supplied function, and then munmaps stack and syscalls exit.