Raw Linux Threads via System Calls

im3w1l · on June 7, 2015

Very nice writeup!

Idea for improvement: Instead of directly jumping to the supplied function, jump to something that first calls the user-supplied function, and then munmaps stack and syscalls exit.

CrystalGamma · on June 7, 2015

> (notice: a, b, c, d)

Either I am wrong or is the actual order a, c, b, d in x86 (I remember at least the 32bit GPRs being coded in the order EAX, ECX, EBX, EDX)? :D

barrkel · on June 7, 2015

The order is AX CX DX BX SP BP SI DI

userbinator · on June 8, 2015

The reason I've heard for this ordering (or why BX is out of alphabetic order) is that in 16-bit x86, it's the only "splittable" (into BH and BL) register that can be used to access memory, so it was grouped together with the last 4 (which can also be used to access memory) in order to simplify the logic.

barrkel · on June 8, 2015

All the first four registers are splittable and can be used in all addressing modes[1], apart from CISCy instructions like loop and jcx that implicitly use CX (count) register, string operations, xlat, imul etc.

My personal suspicion is that it's something to do with bit ordering. If you reverse the order of bits from LSB to MSB, then BX and CX switch places.

[1] Indexed addressing modes do need to use BX or BP as a base on 8086. Indexed modes are a bit CISCy though, considering they are often used with lea to do non-memory calculations.

caipre · on June 7, 2015

What does "order" mean here? Is a register different from any other, except by conventional use?

barrkel · on June 8, 2015

The bits used to represent each register in 8086 machine code, when sorted as binary numbers, come out in this order.

Where the bits go depend on the instruction and addressing mode. They generally follow this pattern; some addressing modes don't support some registers though.

hfern · on June 7, 2015

Yes you're right. Originally there was no B register in generations prior to x86.

masklinn · on June 7, 2015

According to wikipedia, the 8080 already had a B register (its registers were A, B, C, D, E, H and L, the (BC), (DE) and (HL) pairs could be used as 16 bits registers)

DasIch · on June 7, 2015

So you could use this to implement fork without sharing signal handlers and files? Given that the latter two seem like the biggest issue with fork why is - to my knowledge - nobody using this?

JoshTriplett · on June 7, 2015

Numerous people are, though not via raw assembly. You can call clone() via glibc and pass all those same flags; you can also call unshare() from the child process. Take a look at the manpage for clone(), which discusses the difference between the raw syscall and the glibc wrapper. The former works like fork(), so if you use it to spawn a thread rather than a process, you need an assembly wrapper to handle running on a new stack and similar after it returns; the glibc wrapper handles that for you, calling a function pointer you provide:

    /* Prototype for the glibc wrapper function */

    #include <sched.h>

    int clone(int (*fn)(void *), void *child_stack,
              int flags, void *arg, ...
              /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );

    /* Prototype for the raw system call */

    long clone(unsigned long flags, void *child_stack,
              void *ptid, void *ctid,
              struct pt_regs *regs);

clone() of a separate process with the various namespace flags (such as creating new network or filesystem namespaces) forms the basis of all Linux container solutions (including Docker and Rocket). clone() of a thread with unusual namespace flags is less common, but not unheard-of.

pcwalton · on June 7, 2015

At least for files, it's also common to just open your files with O_CLOEXEC these days, which (if you're consistent about doing it) protects you in case one of your app's libraries forks, and allows you to avoid dropping down to clone().

geofft · on June 7, 2015

> So you could use this to implement fork without sharing signal handlers and files?

I think the terminology is a bit confusing here. You can share signal handlers and files, that is, changes in the child are reflected in the parent and vice versa, or you can inherit signal handlers and files: the child starts with a copy of the parent's signal handlers and file descriptor table. Inheritance is the default behavior from fork(). There is no option to reset signal handlers or start with a clean file descriptor table; you have to reset everything yourself.

DasIch · on June 8, 2015

I was indeed confused by the terminology. Thank you for the explanation.

saurik · on June 8, 2015

If the goal is to exec another process, you might want to consider moving to posix_spawn.

faragon · on June 7, 2015

Linux approach is very fast, and convenient for most cases. The only regret I have (had) is that working in that way, it brings some problems like different behavior on signal() versus most POSIX implementations (e.g. signal handlers on a thread not being the main one, etc.). Not really a problem, but a bit annoying when porting UNIX code to Linux.

caf · on June 8, 2015

The signal behaviour required for POSIX threads is available by specifying the CLONE_SIGHAND flag to clone(2). This is what the glibc pthreads implementation does.

Linux did once have a pthreads implementation that wasn't POSIX-conforming in these sorts of areas ("LinuxThreads"), but those days are long behind us. NPTL has been the standard for at least 10 years now.

faragon · on June 12, 2015

Thank you. I'm glad that got fixed! (I dealt with that stuff more than 10 years ago) :-)

0x0 · on June 7, 2015

How can the thread free its own stack before calling exit? Doesn't it need a stack to call exit safely?

to3m · on June 7, 2015

In general yes, but the code in this case appears to be supplying its own version of exit, that just does a SYSCALL instruction. The SYSCALL instruction doesn't need a stack.

0x0 · on June 7, 2015

Is there no risk of an interrupt or a signal happening in between un-mmap'ing and calling exit, that would cause stack pushes?

Amanieu · on June 7, 2015

That is a risk, which is why you need to block all signals using the sigprocmask syscall before unmapping your stack.

geofft · on June 7, 2015

Or you just don't have signal-handling functions. Signals only touch the stack if you handle them via functions. The default signal dispositions all either kill the process or do nothing, so they don't touch the stack.

MarkSweep · on June 7, 2015

It is calling the exit syscall directly using the syscall instruction, which does not push anything onto the stack.