
Fork() without exec() is dangerous in large programs - zdw
http://www.evanjones.ca/fork-is-dangerous.html
======
caf
In my opinion, it's libraries that secretly use threads under the covers (but
still access state shared with the rest of the program like the malloc heap)
that are what's dangerous.

~~~
chubot
Yes. Libraries should provide algorithms and data structures, and parameterize
a threading interface if necessary (though I think it's usually not
necessary).

Applications can then instantiate their own threads and pass them to the
library (i.e. dependency injection.)

So if you have 10 different concurrent libraries in an application, then you
don't end up with 10 different threading paradigms or 10 different thread
pools. This is a matter of application architecture -- libraries shouldn't
bake in a specific policy.

~~~
TickleSteve
true...

...this applies not just to threading, but other things that are not in the
domain of the library such as memory allocation strategy or communication
mechanism.

Libraries should be small, focused and decoupled. This is the way to proper
re-usability.

~~~
amag
Like left-pad?

~~~
TickleSteve
jeez, no.... extremes of everything are terrible.

I suppose another rule should be that there libraries should encapsulate a
sufficient amount of complexity to make their existence worthwhile.... unlike
left-pad.

------
rdtsc
Yap. Erlang 19 (latest version) switched to using a smaller spawner executable
and spawns all OS processes from there by forking. So it forks something
restricted and small not the whole VM.

Here are the details of how it works:

[https://github.com/erlang/otp/blob/a5256e5221aff30f6d2cc7fab...](https://github.com/erlang/otp/blob/a5256e5221aff30f6d2cc7fab4875914ae734217/erts/emulator/sys/unix/erl_child_setup.c#L21-L49)

They also claim a 3-5x speedup for launching external commands because of it.
So there is a nice performance boost as well.

So basically can add 4th strategy -- fork once a small program at the start,
then fork from there from then on.

~~~
evanj
Thanks for pointing this out! Interesting that the code states that part of
the concern is increasing memory usage. I don't quite understand why it would
"duplicate the memory usage" since fork uses copy-on-write?

Anyway, I didn't explain it well, but this was what I was trying to get at
with "fork a worker at the beginning of your program, before there can be
other threads."

~~~
the_why_of_y
Copy-on-write doesn't work the same on every OS... Linux uses over-commit so
it's not a big deal, but Solaris (and NT) don't over-commit so you actually
need enough VM at the time when you call fork or fork will fail, so you may
need to provide a lot of swap space on those systems to successfully call fork
in large processes.

~~~
evanj
Ah thanks, that makes sense. Fun trade-offs: Either run out of memory when you
call fork, or fight with the out-of-memory killer at a later time :)

------
qwertyuiop924
In some programs, fork(2) is the right way to do concurrency. It's simple
(API-wise), and it makes IPC explicit, as opposed to the implicit IPC of
threads. clone isn't posix, so you can forget it, if you don't feel like being
behind systemd in the "screw any system that isn't Linux" line.

In places where you need the speed, threads are useful, but they're harder to
use than forks.

Annoyingly, threads and forks don't work well together. zzzcpan already talked
about how to fork threaded code. As for the problem of libraries using
threads, if a library you're using is using threads, and it's documented (it
probably is), and you didn't know about it, I'll pencil that in as your fault.

Other comments in this thread have dismissed fork(2) as a bad job entirely,
but I don't think I agree. It's an effective way to do simple multiprocessing,
and it's a lot simpler than threads in many contexts.

------
kostyash
It sounds like if you use threads and fork it might be a disaster. Its true,
but the author blames only fork and skip the second part of the problem,
threads. This is not fair, because fork is much older than POSIX threads. In
fact, POSIX threads were poorly designed to use with fork.

~~~
evanj
Fair point. My opinion is that in today's age of multi-core CPUs, shared
memory concurrency is extremely useful for making efficient use of computing
resources. As a result, I find threads to be unavoidable in most large systems
I've dealt with recently.

~~~
kostyash
Agree, that multi-threading probably the best choice for CPU-intensive
applications. But for databases or http-servers multi-process +
coroutines/fibers can be a better solution. For example, Redis is few-thread
application. It creates additional threads only for disk I/O, because
unfortunately file descriptors in Linux/UNIX do not work in async mode.

Btw, Redis reminds me, one really awesome usage of fork. To make a snapshot of
itself Redis forks the main process. After that the child simply goes through
tuples and write them to a files with out any worries that somebody will
modify a records. Copy-on-write mechanism simply prevents it.

Redis save snapshot code:
[https://github.com/antirez/redis/blob/unstable/src/rdb.c#L99...](https://github.com/antirez/redis/blob/unstable/src/rdb.c#L996)

------
zzzcpan
> How to use fork safely: 1. Only use fork to immediately call exec. 2. Fork a
> worker at the beginning of your program, before there can be other threads.
> 3. Only use fork in toy programs.

4\. Stop writing broken multithreaded code altogether or if you must at least
run an event loop per core/thread and use a wrapper for fork() to put the
system into a fork()able state before forking. It's nice and reliable.

Multithreading by itself is just not a high-level concept to be used reliably
by programs.

~~~
valarauca1
Fork/Exec is only good if you are going to run a _different_ program then you
are currently executing.

If you want a series of worker threads that are _effectively_ the same as the
master/initial thread. The proper way to do this clone(2) not fork, or
fork/exec.

When used properly clone allows your group of threads to share a single PID,
and TGID (Thread Group IDentifier). This cuts down on kernel resources your
process(es) are using. They can natively share file descriptors and memory
between each other (improving cache coherence). Also their signals are handled
globally for all threads at once, not each thread managing it's own signals
like Fork/Exec will result in.

~~~
qwertyuiop924
...And why would you want that? The advantage for fork(2) over clone is that
you CAN'T share those datastructures, making it harder to write code resulting
a deadlock: you have to share explicitely.

As for the kernel resources for the process, you'd be surprised how large RAM
has gotten these days...

~~~
valarauca1

       ...And why would you want that?
    

All your applications share the same virtual memory space. So you have a much
higher hit rate in your TLB, then without. It greatly lowers the chance that
shared memory will leave L3 cache.

Cache hit rates are very important...

    
    
         The advantage for fork(2) over clone is that you CAN'T
         share those datastructures, making it harder to write
         code resulting a deadlock
    

Dead locking has nothing to do with how RAM is accessed or how virtual memory
is partitioned, it has to do with how you are managing your locking. Modern
memory/instruction re-ordering is stupidly fast. If you are locking
memory/data structure locations you are likely doing something wrong.
Concurrent Memory fences are around 3 orders of magnitude faster then locks.

    
    
         As for the kernel resources for the process, you'd be
         surprised how large RAM has gotten these days...
    

TLB/L2/L3 cache is still premium real-estate. RAM size is awesome! I love
waiting 100,000cycles to get the page I requested.

~~~
qwertyuiop924
Cache hit rates aren't as important as you seem to think. If you're not
google, and your app doesn't have hard/soft realtime constraints (videogames,
flight control systems, etc.) your app will probably be fast enough. You'd be
surprised how fast computers are nowadays...

    
    
      >If you are locking memory/data structure 
      locations you are likely doing something wrong. 
    

Oh, concurrent memory fences! Well, that solves all my problems. Let's see.
All I have to do is give up all hope of my code ever being portable to mutiple
architectures, and than dig around in asm to add support for them to my
environment, which can range from mildly annoying (C), to near-impossible
(JVM). Or I can use fork(2), and do what programmers have been doing for
decades: trading performance for simplicity, and productivity.

That'll be a _really_ hard decision.

~~~
valarauca1

        All I have to do is give up all hope of my code ever
        being portable to multiple architectures
    

You know nothing about modern compiler atomics.

Post 2011 compilers (LLVM, GCC, MSVC, ICC) standardized generic memory fences
for C++ and C. These are fully portable as the compiler itself determines how
the layout of Acquire/Releases needs to be changed on platform you are
compiling too.

Atomics are supported on ARM, x64, MIPS64, SPARC64, POWER8, POWER9...

So they're very portable. The compiler even manages removing unneeded fences.
Like if you add an acquire fence after a CAS load. The CAS load is an acquire
fence on x64, but not on POWER8.

~~~
qwertyuiop924
Okay, what about HLLs? Memory fences don't work if you aren't writing C. And
as I've said, forks are an easier way to get these guarantees, albeit at a
moderate perf cost.

~~~
valarauca1
Seeing as this a conversation thread about raw system calls I assume we were
working C not a HLL.

Are you just a professional contrarian?

Why are you concerned about system calls in a HLL?

Why are you concerned with concurrency guarantees in a HLL?

You care about none of these, you care about your run time. This is why you
are working in a HLL.

~~~
qwertyuiop924

      Are you just a professional contrarian?
    

No.

    
    
      Why are you concerned about system calls in a HLL?
    

Because in many HLLs, fork(2) is the best (or sometimes the _only_ ) option
for concurrency. As in C, it is certainly the simplest.

Many things written in HLLs use fork. Unicorn is perhaps the most famous
example.

    
    
      Why are you concerned with concurrency guarantees in a HLL?
    

Because if I have multiple threads of execution, I want them to run in
parallel if possible, and I want to minimize the risk of deadlocks. fork(2)
does both.

    
    
      You care about none of these, you care about your run time. This is why you are working in a HLL.
    

No. I care about them, as demonstrated above. And I don't know what you mean
by care about my runtime. I would go so far as to claim that the above is an
insult to me and everybody else who uses an HLL.

I care about getting stuff done. If C is the right tool for that, I'll use it.
If an HLL is the right tool, I'll use it. And if syscalls are the right tool
for the job, I'll use them, too.

------
antirez
If you are in control of all the threads that an application is running, you
can call fork() safely by making sure the threads are put in a safe state (no
critical locks retained) when the fork() call happens. Also note that many
things that should normally be unsafe, like having running threads calling
malloc() while another thread is forking, are actually safe in the real world
using certain implementation of fork, since there are pre/post fork "hooks" in
the malloc implementation in order to fix the state of the child.

So if you control very closely the libs you link with and what they do, as
well as the threads you use yourself, it is possible to use fork() in a
reliable way.

~~~
tankenmate
glibc attempts this using the pthread_atfork() hook. it takes the global
malloc (and possibly other) forks in the parent before forking, then both the
parent and child release the lock before returning to the callee's code.
obviously if you have locks in your own code then you may or may not need to
hold these before calling fork() so that both parent and child can release
them after the fork().

the apple way of just throwing its hands in the air and abort()ing sound
either like giving up because the apple devs think it is too hard, or because
they feel the need to impose their way in crippling code that uses fork()
without calling exec()

in my opinion if you are going to write a library that uses threads and is
supposed to be thread safe then remember you need to handle the issue of
fork(); don't be lazy, do the job properly, even if doing it right it is hard
and ugly.

------
ambrop7
The fork-exec code in my program currently does some things that are
supposedly unsafe, but I can't figure what to use instead of initgroups()
which is not async-signal-safe. I want to run the child as a different user
for which I do initgroups+setgid+setuid between fork and exec. The only
solution I see is to run getgrouplist() before the fork then in the child use
setgroups() instead of initgroups(), but both of these functions are non-
standard.

EDIT: Never mind, seems like initgroups() is also nonstandard but generally
available on Unix-like systems.

------
datenwolf
Hmm, consider we'd want to keep fork(), then the only safe way to deal with
this would be, that critical sections were actually transactions implemented
on the OS level and at after fork() in the child all transactions in flight
get rolled back before returning from fork().

I see two implementation challenges with that suggestion:

1.) implementing that transaction mechanism as a kernel feature: When entering
a CS mark all pages CoW, upon leaving the CS merge modified pages (problem:
Whole pages are then mutual exclusive, dealing with this is the challenge)

2.) battling with user space implemented locks that use atomics.

\-----

An immediate mitigation I see is, that fork() itself is a CS on _all_ the
locks of a process. If we consider that only the standard locking mechanisms
are used, then whenever a CS is entered (which includes the creation process
of a locking primitive) it raises/posts a global fork-lock semaphore. And upon
leaving that semaphore is lowered.

This still leaves the DIY-locking primitives problem open. But it should be
more or less straightforward to add this to the system libc/pthread libraries'
locking primitive implementation and fork() syscall wrappers.

Or did I miss something essential here? Talk is cheap, so if nobody has any
obvious objections I'd actually go ahead implement it.

EDIT: Okay, one immediate problem I see is, that this would pose a challenge
for calling fork inside a CS. Technically this is a situation where thread
recursive locks would help, but as we all know, recursive locks are highly
problematic.

~~~
the_why_of_y
Sounds like you are trying to re-invent Software Transactional Memory - was a
nice idea 10 years ago, but after lots of research it is still not available
in common runtimes...

Some more problems to consider:

1\. how do you roll back IO that happens in a critical section? 2\. your
global semaphore will be highly contended and basically destroy scalability

~~~
datenwolf
re 1) good point, didn't think of that

re 2) the fork global lock satisfies every requirement for a multiple-
readers/single-writer, with fork() being the only writer. Unless the program
does a lot of fork()-ing it should hardly ever run into contention. The moment
fork() waits on the lock, further attempts on read-lock are delayed until
after fork() completes and the existing reads are waited for, before
fork()-ing.

------
ahh
Endorsed.

Even fork _with_ exec can be real trouble. This is one of my bugaboos at work:
due to poor life choices and high pain tolerance, I own the infrastructure we
use to spawn subprocesses (carefully.) For various reasons (security most
notably) we have to do some very tricky things in and to the forked child
before exec(), and pretty much all of this code is a disaster waiting to
happen. Every so often I get feature requests for more stupid pet tricks
people would like out of subprocesses, and they're always surprised by what
their "simple" change would entail.

I'd like it if Linux had native support for posix_spawn, but even that would
require a lot of extensions to be useful.

Don't get me started on the teams that want to break forking rules and thus
ask me how to guarantee a process has no non-main threads. There are few ways
you can make me more upset than by building software that breaks if some one
else happens to call pthread_create and doesn't tell you.

------
saynsedit
Face the fact, fork() is fundamentally flawed!

~~~
ahoka
I would like to see the reasons this comment was down-voted. It is true that
fork(2) is simple and elegant design, but it involves a huge accidental
complexity and hidden implications like in the original article. There is a
reason that Linux has clone(2) underneath.

~~~
the_mitsuhiko
> It is true that fork(2) is simple and elegant design

What exactly is simple and elegant about it? Have you looked into how much
tooling is necessary everywhere in unix to make it work? It's insane. It has a
huge footprint and it does not provide standardized APIs to make it work for
non covered cases (the best we have is pthread_atfork which is not portable).

~~~
lmm
> What exactly is simple and elegant about it?

Spawn-style APIs have to take a billion parameters that are mostly set to
defaults, for things like working directory, environment variables, user ID,
controlling terminal. Whereas fork+exec means you can express these things in
a more compositional, buildery style: fork, change the two things you actually
need to change, then exec.

~~~
MichaelGG
I don't think this makes the case for elegance at all.

Global vars set wherever in a program are preferable to explicit params when
creating a new process? And there's no way to have default params or anything?

~~~
lmm
> Global vars set wherever in a program are preferable to explicit params when
> creating a new process?

The outside world is always going to be implicit global mutable state;
spawning an arbitrary process to do who-knows-what is inherently that kind of
problem. If you could express it as an actual function you wouldn't need to
run a separate process.

> And there's no way to have default params or anything?

Well not in C. You can sort-of do it by having a struct with fields that you
fill in where all-bits-zero is the default but that's generally more trouble
than it's worth.

~~~
MichaelGG
Well by that reasoning there's never any need for function purity since you
can't eliminate global state on current computer designs.

C easily allows you to have multiple functions. exec_with_current_state or
something. And a struct is a way, too.

~~~
lmm
> Well by that reasoning there's never any need for function purity since you
> can't eliminate global state on current computer designs.

If you care about purity you don't use spawn-like functionality at all.

> C easily allows you to have multiple functions. exec_with_current_state or
> something.

That doesn't help. You'd need to have one variant where you just want to
change the working directory, one variant where you just wanted to set an
environment variable, one variant where you want to do both, one variant where
you want to run as a different UID and change working directory but not touch
the environment variables...

------
xroche
fork() is not per se "dangerous" in such context. You just have to be careful
enough and only use asynchronous-signal-safe fonctions, as you would do in a
signal handler (see
[https://www.securecoding.cert.org/confluence/display/c/SIG30...](https://www.securecoding.cert.org/confluence/display/c/SIG30-C.+Call+only+asynchronous-
safe+functions+within+signal+handlers))

Typically calls such as malloc(), printf() etc. are strictly forbidden in the
child after a fork().

~~~
ej_campbell
Only if your app is multi-threaded. Unfortunately, many apps are multi-
threaded "under the covers" due to libraries that spawn threads, so you need
to be careful.

This post does the best job explaining the issue:
[http://www.linuxprogrammingblog.com/threads-and-fork-
think-t...](http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-
before-using-them)

------
hyperbovine
I ran into this exact problem recently in a multithreaded Python app and spent
two days trying to figure out wtf was going on. The multiprocessing spawn
startup mode that was added in 3.4 solves this for most use cases at the
expense of a small performance hit. For 2.x you are SOL however.

~~~
evanj
Part of my motivation for writing these things is because after I've wasted so
much time, I'd love to help others not do the same. Hopefully the article
shows up when you search for the right error message. Its also so _I_ remember
what the heck was happening.

Re: multiprocessing in 3.4: I think it is unfortunate that due to backwards
compatibility, the "forkserver" mode can't become the default on Unix, since
it would help avoid this particular issue.

------
known
[https://en.wikipedia.org/wiki/Thrashing_(computer_science)](https://en.wikipedia.org/wiki/Thrashing_\(computer_science\))

------
lsiebert
If there are specific issues with locks from dead child processes that
deadlock, maybe the locks could be addressed specifically.

