
Raw Linux Threads via System Calls - signa11
http://nullprogram.com/blog/2015/05/15/
======
im3w1l
Very nice writeup!

Idea for improvement: Instead of directly jumping to the supplied function,
jump to something that first _calls_ the user-supplied function, and then
munmaps stack and syscalls exit.

------
CrystalGamma
> (notice: a, b, c, d)

Either I am wrong or is the actual order a, c, b, d in x86 (I remember at
least the 32bit GPRs being coded in the order EAX, ECX, EBX, EDX)? :D

~~~
barrkel
The order is AX CX DX BX SP BP SI DI

~~~
userbinator
The reason I've heard for this ordering (or why BX is out of alphabetic order)
is that in 16-bit x86, it's the only "splittable" (into BH and BL) register
that can be used to access memory, so it was grouped together with the last 4
(which can also be used to access memory) in order to simplify the logic.

~~~
barrkel
All the first four registers are splittable and can be used in all addressing
modes[1], apart from CISCy instructions like loop and jcx that implicitly use
CX (count) register, string operations, xlat, imul etc.

My personal suspicion is that it's something to do with bit ordering. If you
reverse the order of bits from LSB to MSB, then BX and CX switch places.

[1] Indexed addressing modes do need to use BX or BP as a base on 8086.
Indexed modes are a bit CISCy though, considering they are often used with lea
to do non-memory calculations.

------
DasIch
So you could use this to implement fork without sharing signal handlers and
files? Given that the latter two seem like the biggest issue with fork why is
- to my knowledge - nobody using this?

~~~
JoshTriplett
Numerous people are, though not via raw assembly. You can call clone() via
glibc and pass all those same flags; you can also call unshare() from the
child process. Take a look at the manpage for clone(), which discusses the
difference between the raw syscall and the glibc wrapper. The former works
like fork(), so if you use it to spawn a thread rather than a process, you
need an assembly wrapper to handle running on a new stack and similar after it
returns; the glibc wrapper handles that for you, calling a function pointer
you provide:

    
    
        /* Prototype for the glibc wrapper function */
    
        #include <sched.h>
    
        int clone(int (*fn)(void *), void *child_stack,
                  int flags, void *arg, ...
                  /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );
    
        /* Prototype for the raw system call */
    
        long clone(unsigned long flags, void *child_stack,
                  void *ptid, void *ctid,
                  struct pt_regs *regs);
    
    

clone() of a separate process with the various namespace flags (such as
creating new network or filesystem namespaces) forms the basis of all Linux
container solutions (including Docker and Rocket). clone() of a thread with
unusual namespace flags is less common, but not unheard-of.

~~~
pcwalton
At least for files, it's also common to just open your files with O_CLOEXEC
these days, which (if you're consistent about doing it) protects you in case
one of your app's libraries forks, and allows you to avoid dropping down to
clone().

------
faragon
Linux approach is very fast, and convenient for most cases. The only regret I
have (had) is that working in that way, it brings some problems like different
behavior on signal() versus most POSIX implementations (e.g. signal handlers
on a thread not being the main one, etc.). Not really a problem, but a bit
annoying when porting UNIX code to Linux.

~~~
caf
The signal behaviour required for POSIX threads is available by specifying the
CLONE_SIGHAND flag to clone(2). This is what the glibc pthreads implementation
does.

Linux _did_ once have a pthreads implementation that wasn't POSIX-conforming
in these sorts of areas ("LinuxThreads"), but those days are long behind us.
NPTL has been the standard for at least 10 years now.

~~~
faragon
Thank you. I'm glad that got fixed! (I dealt with that stuff more than 10
years ago) :-)

------
0x0
How can the thread free its own stack before calling exit? Doesn't it need a
stack to call exit safely?

~~~
to3m
In general yes, but the code in this case appears to be supplying its own
version of exit, that just does a SYSCALL instruction. The SYSCALL instruction
doesn't need a stack.

~~~
0x0
Is there no risk of an interrupt or a signal happening in between un-mmap'ing
and calling exit, that would cause stack pushes?

~~~
Amanieu
That is a risk, which is why you need to block all signals using the
sigprocmask syscall before unmapping your stack.

~~~
geofft
Or you just don't have signal-handling functions. Signals only touch the stack
if you handle them via functions. The default signal dispositions all either
kill the process or do nothing, so they don't touch the stack.

