
Fork is not my favourite syscall - eadmund
http://sircmpwn.github.io/2018/01/02/The-case-against-fork.html
======
rwmj
He misses the main problem with fork which is poor interaction with threads.
Amongst the things you cannot do between the fork and the exec is malloc:
another thread in the parent might be in the middle of malloc, might have
grabbed a lock needed by malloc, then that thread disappears after fork but
the lock is still held by the new child process which deadlocks in malloc.

This and other things are a massive cause of trouble for libraries that need
to run subprocesses but need to also link to main programs that might use
threads (such as libvirt).

~~~
nh2
This sounds incredibly bad.

For most more complex programs and libraries, you have no control over when
they malloc(). If you spawn a thread and run any library function in it, it
may malloc().

Doesn't this mean that any program written in any programming language that
supports multiple threads and spawns subprocesses is broken (it is racy for
deadlocks, if the spawned process also wants to malloc)?

( _EDIT: I misunderstood what the parent said, so it 's not so bad. See my
comment below._)

Is this problem discussed anywhere? Are there known ways how people handle
this problem?

~~~
FooBarWidget
> Doesn't this mean that any program written in any programming language that
> supports multiple threads and spawns subprocesses is broken (it is racy for
> deadlocks, if the spawned process also wants to malloc)?

Yes.

> Are there known ways how people handle this problem?

The usual way is to ensure that after forking you only execute async-signal-
safe code. Which includes exec(). So if you have more complicated logic which
cannot be made async-signal-safe, you exec() an executable which performs that
logic.

This is a massive pain though. For example setenv() is not async-signal-safe
so if I want to exec() something with specific env vars then I have to go
through /usr/bin/env. Sigh.

~~~
AdieuToLogic

      if I want to exec() something with specific
      env vars then I have to go through
      /usr/bin/env. Sigh.
    

Why not use execve[0] to set up a specific environment for child processes?

0 -
[https://www.freebsd.org/cgi/man.cgi?query=execve&apropos=0&s...](https://www.freebsd.org/cgi/man.cgi?query=execve&apropos=0&sektion=0&manpath=FreeBSD+11.1-RELEASE+and+Ports&arch=default&format=html)

~~~
0xcde4c3db
Maybe a tangent, but I never really understood how the memory pointed to by
_envp_ is supposed to be managed. Does fork() copy the allocator state such
that you can just do malloc(), fork(), then free() in the parent process, then
execve() blows away the whole memory map in the child?

~~~
AdieuToLogic
Yes, the flow you describe is pretty much how it goes. The OS will "overlay"
the fork'ed child process if exec succeeds in loading the program specified.
If not, errno is set and typically the child process reports the problem
(using something like strerror_r[0]) then exits.

In either case, the parent process can free the memory allocated for the
execve call once it returns.

0 -
[https://www.freebsd.org/cgi/man.cgi?query=strerror_r&apropos...](https://www.freebsd.org/cgi/man.cgi?query=strerror_r&apropos=0&sektion=3&manpath=FreeBSD+11.1-RELEASE+and+Ports&arch=default&format=html)

------
TuringTest
The concept of Fork was not designed to "spawn new programs", it was created
to allow parallelism on a single program with several processes collaborating
on the same task. It's even in the name (i.e. you _fork_ the program in two,
not _spawn_ a new process).

For this purpose, the choice to start with the same data structures and the
same execution point is quite sensible - it's essentially what happens when
you create a thread nowadays. So, the hypothetical discussion where someone
decided to copy the address space in order to launch new programs, would never
have happened.

The day where someone decided to _reuse_ the existing fork code to launch
processes for new programs, that one must have been worth to watch.

~~~
majewsky
> it's essentially what happens when you create a thread nowadays

I'm not familiar with the guts of threading libs, but AFAIK the syscall that
spawns threads is clone(), which takes a function pointer and a stack pointer,
so it very much does not start "with the same data structures and the same
execution point".

~~~
tialaramex
You're talking about the C library function clone(). The actual system call
"clone" on Linux doesn't care about functions, or stack pointers, it just sets
a flag like the system call fork did. Everything else is implemented in the
library by checking the return code.

This is all fine, there are two scenarios, neither of which is the imagined
"race" described above:

1\. "Fork-and-exec" Our new execution context will actually run a different
program. We immediately do so, the new program doesn't care about most of the
state of the old program, it throws that state away, including locks that were
held by the old program's allocator.

2\. "New Thread" Our new context will share all the state, if another thread
has a process-wide lock for some reason, it can and _should_ apply to us too.
If it took a thread-specific lock, we're not that thread, so we already don't
care.

~~~
majewsky
> You're talking about the C library function clone().

Huh? I saw that the manpage for clone() is in section 2, which AFAIK means
it's kernel API, whereas C library functions are usually in section 3. Am I
mistaken?

Even syscalls(2) refers to clone(2) in its list of syscalls.

~~~
tialaramex
Yeah, no.

You are correct that the manual section 2 is "for" system calls, but in
practice today these are just the lowest level of the C library, and not the
actual system calls.

My copy of the clone(2) manual page hints at this by commenting /* For the
prototype of the raw system call, see NOTES */

and more explicitly in its body text by saying "The main text describes the
wrapper function; the differences for the raw system call are described toward
the end of this page."

Other modern stuff is the same way, the section 2 documentation is describing
what an ordinary C programmer needs to know, but if you're say, implementing
the C standard library or writing a compiler for a language that runs on the
bare metal you need to read the kernel documentation which is not provided as
manual pages.

------
sitkack
My favorite part of fork+exec is the _temporary_ exhaustion of virtual address
space on a heavily loaded system, where the parent process is using a shitton
of memory. 50G forked is instantly 100G of virtual address space, even if it
is followed by an exec. Couple this with sysadmins not configuring swap files
means that virtual address space == physical address space. On a heavily
loaded system, high memory using processes can't even launch simple utilities
w/o incurring the wrath of the OOMK. Configure swap, even if it will never be
used, it still affects the size of your virtual address space.

~~~
test608481164
That really depends on your overcommit settings. It seems that you describe an
overcommit = 2 case, when the total VIRT for all processes will not exceed
swap + memory * overcommit_ratio%. But that's not default mode (default is
overcommit = 0). And you can mitigate this issue in other ways (as enabling
swap can lead into some very bad situations when _physical_ memory gets
exhausted):

\- set overcommit = 1, OOM now comes only when sum(RSS) exhausts swap + memory

\- just increase overcommit_ratio if you want to stay under overcommit =2

------
JdeBP
Another person's favourites list is more extensive. (-:

* [https://www.cloudatomiclab.com/antisyscall/](https://www.cloudatomiclab.com/antisyscall/) ([https://news.ycombinator.com/item?id=15998828](https://news.ycombinator.com/item?id=15998828))

* [https://www.cloudatomiclab.com/prosyscall/](https://www.cloudatomiclab.com/prosyscall/) ([https://news.ycombinator.com/item?id=15998344](https://news.ycombinator.com/item?id=15998344))

------
cryptonector
I'm just going to drop this here:
[https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c...](https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234)

------
mcguire
" _Fork is the direct cause of the stupidest component I’ve ever heard of in
an operating system: the out-of-memory (aka OOM) killer._ "

Well, technically, no.

Originally (1st Edition?), IIUC, fork() would fail if enough memory to copy
the process wasn't free. Then, when fork() was changed (4th Edition?) to read-
only share the address space, that wouldn't happen, but the exec() might fail
if there wasn't enough space to set up the child. But those were all doing
segmented memory so they're completely irrelevant to modern systems.

The OOM-killer comes with (IIRC) BSD's paged virtual memory. Some bright spark
(yes, I have hated this design since I first tripped over it in the early- to
mid-'90s) had the brilliant idea of supporting sparse matrices by not
requiring allocated memory to be backed by valid pages, either RAM or disk.

If you don't require allocated memory to be backed, you can allocate a giant
block of memory; if you never touch most of it, everything will be copacetic.
It's good, I was told ca. 1992, for scientific programs in, like, FORTRAN.
Yay. Unfortunately, lots of programs started using the feature, including the
X server.

(Why do I hate with the fiery passion of a thousand suns the perverted genius
who came up with this? AIX 3.2.5/4.1. Running the machine out of memory
(remember, allocations don't fail, everything chugs along until some proggie
touches an address and the OS can't realize the page) would kick off the OOM-
killer, which would immediately hit inetd in the head with its little silver
hammer, resulting in a machine that would still be working, -ish, sort-of, but
unless you were logged in somewhere already to reboot it, you would have to go
visit the physical console.)

Turning off the OOM-killer and overallocation policy was possible, but not
advised, since many programs, including the X server, would suddenly use a
truly atrocious amount of memory.

TL;DR: Memory overcommitment doesn't have anything to do with fork()/exec(),
but rather is a deliberate design decision made with the addition of paged
virtual memory, to support a limited class of applications, that probably
didn't work very well even for that class.

~~~
sitkack
The proper selfish thing to do is lock all of your pages immediately after
allocation.

[https://linux.die.net/man/2/mlock](https://linux.die.net/man/2/mlock)

------
ypsu
I see a few people complaining about oom killing. Allow me to make a case for
it. I think oom killing is a very pragmatic solution to deal with low memory.
Without overcommit you don't need the oom killer. But just simply disabling
overcommit will make your system unusable during low memory situation just as
much. Imagine that one job ate all your memory. How do you restore your
system? You cannot allocate more memory so you can no longer start up a new
shell to kill the bad process. So you are just as stuck with an unusable
system. I wrote about this here:
[https://superuser.com/a/984715](https://superuser.com/a/984715)

What I really want is to have an always responsive system. You cannot get this
easily on linux because it doesn't allow to disable all paging (executable
pages are still paged out even with swap disabled). This means you can easily
lock up your machine if you use a lot of memory (even without swap -- I don't
use swap). The above post contains a 20 line C program example that will break
your system that you can only cure by applying the oom killer. I mean what
other options do you have you have 0 free memory? Could a kernel hacker
reading this add such an option to linux to a poor sysadmin like myself?

------
jorangreef
"What if we copied the entire address space of the program into a new process
running from the same spot"

This was actually the cause of 1-2 second blocks in our Node event loop. We
were spawning a clamd scan asynchronously, from a process with several
gigabytes of RSS. I never would have imagined that fork would blindly copy
gigabytes of RSS every invocation. I should have known. It took a few months
to track down.

The kicker is that Node calls fork from the main event loop thread and not
from the thread pool, in order to hack around the GC, if I understand
correctly.

See the relevant Node issue here:

[https://github.com/nodejs/node/issues/14917](https://github.com/nodejs/node/issues/14917)

~~~
rkeene2
fork() doesn't copy the blocks -- they're CoW -- but rather the time it takes
is in setting up the CoW table. You can mitigate this if you tell clone(2)
that the only thing you are going to do right after creating a new process is
execve(2) so don't bother actually creating that. I see that mentioned at the
end, but with the question of whether the setup is also combinable (it
probably is).

~~~
vram22
I'm not quite sure, but I think I remember reading that CoW (Copy on Write)
was not there in the earliest Unixes. Might have been implemented at Berkeley
some time later. Seem to remember reading this some years ago when reading
about fork, because without CoW, fork's copying was expensive, particularly if
you were going to exec another program in the child just after forking.

------
znpy
Imho what the author is either missing on voluntarily ignoring is the context
in which the original Unix was developed: as an unofficial, simpler version of
Multics, after the Multics project had failed.

So my guess is that Thompson and co basically went with a "worse is better"
approach and delivered a model that performed worse but was a lot simpler to
implement and reason about.

Also... It's easy to criticize designs 40 years after. That time was a time of
proposals an research. Despite this, the original Unix design, no matter how
flawed, stood the test of time.

------
srean
I have always wondered about Linux's overcommit feature, which BTW I dislike
-- Doesn't ISO C require that if malloc returns a nonzero pointer, it has to
backed physically (need not be RAM, could be swap on disk or some other
thing). I have been told that the OS is allowed to overcommit but space
obtained through malloc needs to be backed, if it has to comply with ISO C.
Apparently AIX and ISO committee got into some argument about this when AIX's
malloc implementation allowed for overcommit, anyone remembers this ?

~~~
benou
You can always disable it using 'sysctl -w vm.overcommit_memory=2' or have
more control with vm.overcommit_* various sysctls. In the general case, I
found that overcommit is usually more useful than strict accounting, but I
agree that for more industrial cases it can be the opposite (but then you can
spend time tuning it).

~~~
srean
Oh! of course. I was seeking comments on whether it is true that this behavior
is not ISO C compliant.

I understand why this behavior is default, if you __have__ to support fork()
this perhaps the more pragmatic default.

------
DiThi
If it doesn't exist already, there should be a "standard spawn stub" that, in
absence of spawn it creates a new spawner process at the very beginning of an
application start up, so when you do spawn, the spawner process is the one
that forks with none of the copied memory and open file descriptors.

> That this nearly 50 year old crappy design choice has come to this
> astonishes me.

That makes me feel old. For me Unix was always invented 30-something years
ago...

~~~
sitkack
Hopefully Unix will be 100 years old someday.

------
IgorPartola
fork has two uses. One is to run a bunch of child processes that the parent
can control and hand work to. Think web workers in apache2 with the parent
accepting connections and the children processing the requests.

The other is to start unrelated programs. Think bash which can start vim.

I think for the first use case, it is stupid to have to reap the child process
or else it gets inherited by init. Instead you should be able to mark that
process as a worker from the get go. It doesn't make sense for apache2 workers
to keep running if the parent process died.

In the latter use case, I agree with the OP. I'd rather set up the structure
of what I want to happen, then start that process as a completely independent
thing. spawn seems like the way to go for this.

The good news is that we can totally have all those options as long as we keep
fork as is at least for the next 50 years :)

------
pjc50
Really POSIX should just acquire CreateProcess and have done with the problem.
It's the API you generally want when starting a new, unrelated program.
Probably for POSIX you'd want to allow specifying which handles become the new
process stdin/stdout/stderr, since Windows handles consoles in a stranger,
less convenient way.

[https://msdn.microsoft.com/en-
us/library/windows/desktop/ms6...](https://msdn.microsoft.com/en-
us/library/windows/desktop/ms682425\(v=vs.85\).aspx)

~~~
vletrmx
What happens if you want the child to inherit other fds? Or set other
attributes such as the session, euid, egid etc?

The elegance of fork() is in avoiding the complexity of requiring every
attribute of the process to be explicitly stated in order to create a new
process. We simply inherit our parent's attributes and if that turns out not
be desirable it is fixed in the child with the usual syscalls.

EDIT: s/And/Or set/

~~~
jstimpfle
I'm not positive that it's elegant. It may also be considered an ugly hack
(not sure either), and I'm pretty certain it's been responsible for many
security holes.

It would be interesting to see in what ways Unix (especially shell scripts)
would be different if processes would not inherit all the baggage by default.

~~~
mort96
There's a very, very fine line between an elegant solution and an ugly hack.

~~~
jstimpfle
It may be not fine (your sarcasm was noted) but I still don't see it in this
instance.

