
Teleforking a Process onto a Different Computer - trishume
https://thume.ca/2020/04/18/telefork-forking-a-process-onto-a-different-computer/
======
hawski
Seriously cool. That also reminds me of DragonFlyBSD's process checkpointing
feature that offers suspend to disk. In Linux world there were many attempts,
but AFAIK nothing simple and complete enough. To be fair I don't know if DF's
implementation is that either.

[https://www.dragonflybsd.org/cgi/web-
man?command=sys_checkpo...](https://www.dragonflybsd.org/cgi/web-
man?command=sys_checkpoint&section=2)

[https://www.dragonflybsd.org/cgi/web-
man?command=checkpoint&...](https://www.dragonflybsd.org/cgi/web-
man?command=checkpoint&section=ANY)

~~~
jabedude
This article[0] on LWN seems to suggest that Linux has no kernel support for
checkpoint/restore because it's hard to implement (no arguments there). But
hypervisors support checkpoint/restore for virtual machines, e.g. ESXi VMotion
and KVM live migration, so it seems like these technical problems are
solvable. Indeed all the benefits of VM migration seem to also apply to
process migration (load balancing, service uptime, etc).

0\. [https://lwn.net/Articles/293575/](https://lwn.net/Articles/293575/)

~~~
colinchartier
There's also a project called CRIU that is used experimentally by Docker for
container save/load:
[https://www.criu.org/Main_Page](https://www.criu.org/Main_Page)

~~~
jraph
You were faster than me. Yes, CRIU does checkpointing in userspace on Linux.

They contributed a lot of patches to the Linux kernel to support this feature.
So there isn't any support for checkpointing in the mainline kernel. Rather,
there are many kernel features that allows building such a feature in
userspace. For instance, you can set the PID of the next program you spawn on
Linux thanks to CRIU, because CRIU needs to be able to restore processes with
the same PIDs as when they were checkpointed [1].

CRIU is used by OpenVZ, which, if I remember correctly, are moving or moved
away from a kernel-based approach.

[1] [http://efiop-notes.blogspot.com/2014/06/how-to-set-pid-
using...](http://efiop-notes.blogspot.com/2014/06/how-to-set-pid-using-
nslastpid.html)

~~~
jabedude
This was immensely informative, thank you for that link. CRIU looks like an
awesome project. I can't believe I haven't heard of them prior to today.

------
synack
This reminds me of OpenMOSIX, which implemented a good chunk of POSIX in a
distributed fashion.

MPI also comes to mind, but it's more focused on the IPC mechanisms.

I always liked Plan 9's approach, where every CPU is just a file and you
execute code by writing to that file, even if it's on a remote filesystem.

~~~
AstroJetson
For people that want to try OpenMOSIX out, take a look at this site
[http://dirk.eddelbuettel.com/quantian.html](http://dirk.eddelbuettel.com/quantian.html)
He has a distro that is called Quantian, with a big collection of science
tools added. Shame it's Sunday, I'll need to wait a week to pull it down and
see how well it flys.

~~~
AstroJetson
Found the developer, he passed on 'quantian 0.7.9.2 iso' leads to
[http://sunsite.rwth-
aachen.de:3080/ftp/pub/Linux/quantian/](http://sunsite.rwth-
aachen.de:3080/ftp/pub/Linux/quantian/) This was the last release.

------
ISL
What's old is new again -- I'm pretty sure QNX could do this in the 1990s.

QNX had a really cool way of doing inter-process communication over the LAN
that worked as if it were local. Used it in my first lab job in 2001. You
might not find it on the web, though. The API references were all (thick!)
dead trees.

Edit: Looks like QNX4 couldn't fork over the LAN. It had a separate "spawn()"
call that could operate across nodes.

[https://www.qnx.com/developers/docs/qnx_4.25_docs/qnx4/sysar...](https://www.qnx.com/developers/docs/qnx_4.25_docs/qnx4/sysarch/proc.html)

~~~
checker659
> I'm pretty sure QNX could do this in the 1990s.

Plan 9, SmallTalk

~~~
imglorp
Erlang of course. All spawns and messages can be local or remote.

~~~
russellbeattie
Any time I see something about remote processes, I immediately think, "Erlang
could probably do this."

I think Erlang would have been the programming language of the 21st century...
If only the syntax wasn't like line noise and a printer error code had a baby,
and raised it to think like Lisp.

~~~
sosodev
Elixir is pretty cool. You get all the power of Erlang wrapped up in a much
more developer friendly language.

------
peterwwillis
It's nice to see people re-discover old school tech. In cluster computing this
was generally called "application checkpointing"[1] and it's still in use in
many different systems today. If you want to build this into your app for
parallel computing you'd typically use PVM[2]/MPI[3]. SSI[4] clusters tried to
simplify all this by making any process "telefork" and run on any node (based
on a load balancing algorithm), but the most persistent and difficult
challenge was getting shared memory and threading to work reliably.

It looks like CRIU support is bundled in kernels since 3.11[5], and works for
me in Ubuntu 18.04, so you can basically do this now without custom apps.

[1]
[https://en.wikipedia.org/wiki/Application_checkpointing](https://en.wikipedia.org/wiki/Application_checkpointing)
[2]
[https://en.wikipedia.org/wiki/Parallel_Virtual_Machine](https://en.wikipedia.org/wiki/Parallel_Virtual_Machine)
[3]
[https://en.wikipedia.org/wiki/Message_Passing_Interface](https://en.wikipedia.org/wiki/Message_Passing_Interface)
[4]
[https://en.wikipedia.org/wiki/Single_system_image](https://en.wikipedia.org/wiki/Single_system_image)
[5]
[https://en.wikipedia.org/wiki/CRIU#Use](https://en.wikipedia.org/wiki/CRIU#Use)

------
fitzn
Really cool idea! Thanks for providing so much detail in the post. I enjoyed
it.

A somewhat related project is the PIOS operating system written 10 years ago
but still used today to teach the operating systems class there. The OS has
different goals than your project but it does support forking processes to
different machines and then deterministically merging their results back into
the parent process. Your post remind me of it. There's a handful of papers
that talks about the different things they did with the OS, as well as their
best paper award at OSDI 2010.

[https://dedis.cs.yale.edu/2010/det/](https://dedis.cs.yale.edu/2010/det/)

------
dekhn
Condor, a distributed computing environment, has done IO remoting (where all
calls to IO on the target machine get sent back to the source) for several
decades. The origin of Linux Containers was process migration.

I believe people have found other ways to do this, personally I think the ECS
model (like k8s, but the cloud provider hosts the k8s environment) where the
user packages up all the dependencies and clearly specifies the IO mechanisms
through late biniding, makes a lot more sense for distributed computing.

~~~
vidarh
I clicked through to mention Condor too... I first came across it in the 90's,
and it seems like one of those obvious hacks that keeps being reinvented.

~~~
dekhn
I was actually channeling the creator of Condor, Miron Livny, who has a
history of going to talks about distributed computing and pointing out that
"Condor already does that" for nearly everything that people try to tout as
new and cool.

He's not wrong, but few people use Condor.

~~~
iaresee
Few people outside academia, maybe? But inside it still seems to dominate in
areas like physics computation. CERN uses a world wide Condor grid. LIGO too.
It’s excellent for sharing cycles for those slow-burn, highly parallel,
massive data scale problems.

I spent more than a decade bringing Condor to semi-conductor, financial and
biomedical institutions. It was always a fight to show them there was a better
way to utilize their massive server farms that didn’t require paying the LSF
Tax. Without a shiny sales or marketing department, Condor was hard to pitch
to IT departments.

Still, to this day, I see people doing things with “modern” platforms like
Kubernetes and such and I chuckle. Had that in Condor 15 years ago in many
cases. :)

~~~
0xdeadbeefbabe
> Still, to this day, I see people doing things with “modern” platforms like
> Kubernetes and such and I chuckle. Had that in Condor 15 years ago in many
> cases. :)

I'm reading the docs and it seems used mostly for solving long running math
problems like protein folding or seti at home?

Can it be used for scaling a website too? I think that's k8s "killer" feature
heh.

~~~
dekhn
Most systems like Condor have a concept that a job or task is something that
"comes up, runs for a while, writes to log files, and then exits on its own,
or after a deadline". I've talked to the various batch queue providers and I
don't think they really consider "services" (like a webserver, app server, rpc
server, or whatever) in their purview.

In fact, that was what struck me the most when I started at Google (a looong
time ago): at first I thought of borg like a batch queue system, but really
it's more of a "Service-keeper-upper" that does resource management and batch
jobs are just sort one type of job (webserving, etc are examples of "Service
jobs") laid on top of the RM system.

Over time I've really come to prefer the google approach, for example when a
batch job is running, it still listens on a web port, serves up a status page,
and numerous other introspection pages that are great for debugging.

TBH I haven't read the Condor, PBS, LSF manuals in a while so it's very well
possible they handle service jobs and the associated problems like dynamic
port management, task discovery, RPC balancing, etc.

~~~
iaresee
But in a world where you're continuously deploying on a cadence that's
incredibly quick, how do things differ? I contend the batch and online worlds
start to get pretty blurry at this stage. We're not in a world where bragging
about uptime on our webserver in years is a thing any more.

I was routinely using Condor in semi-conductor R&D and running batches of jobs
where each job was running for many days -- that's probably far longer than
any single instance of a service exists at Google in this day and age, right?

None of the batch stuff does the networking management though. No port
mapping, no service discovery registration, no load balancer integration, etc.
That's Kubernetes sugar they lack. But...has never struck me as overly hard to
add, especially if you use Condor's Docker runner facilities.

Edit: I should say that I don't _really_ think you could swap out Kubernetes
for Condor. Not easily. But it's always been in my long list of weekend
projects to see what running an cluster of online services would be like on
Condor. I don't think it'd be awful or all that hard.

The other killer Condor tech is their database technology. the double-
evaluation approach of ClassAd is so fantastic for non-homogenous
environments. Where loads have needs and computational nodes have needs and
everyone can end up mostly happy.

------
userbinator
_This can let you stream in new pages of memory only as they are accessed by
the program, allowing you to teleport processes with lower latency since they
can start running basically right away._

That's what "live migration" does; it can be done with an entire VM:
[https://en.wikipedia.org/wiki/Live_migration](https://en.wikipedia.org/wiki/Live_migration)

~~~
dmitrygr
Hey what's the best way to contact you? Have a question. If possible my email
is in my profile.

~~~
sitkack
[https://www.usenix.org/node/170864](https://www.usenix.org/node/170864)

------
Animats
That goes back to the 1980s, with UCLA Locus. This was a distributed UNIX-like
system. You could launch a process on another machine and keep I/O and pipes
connected. Even on a machine with a different CPU architecture. They even
shared file position between tasks across the network. Locus was eventually
part of an IBM product.

A big part of the problem is "fork", which is a primitive designed to work on
a PDP-11 with very limited memory. The way "fork" originally worked was to
swap out the process, and instead of discarding the in-memory copy, duplicate
the process table entry for it, making the swapped-out version and the in-
memory version separate processes. This copied code, data, and the process
header with the file info. This is a strange way to launch a new process, but
it was really easy to implement in early Unix.

Most other systems had some variant on "run" \- launch and run the indicated
image. That distributes much better.

~~~
wrs
No need to use the past tense — CreateProcess is the primitive in Windows NT.
(It's been 20 years...can we just call that Windows now?)

~~~
zozbot234
posix_spawn is a thing as well.

~~~
nwmcsween
posix_spawn is horrendous

------
dreamcompiler
Telescript [0] is based on this idea, although at a higher level. I wish we
could just build Actor-based operating systems and then we wouldn't need to
keep reinventing flexible distributed computation, but alas...[1]

[0]
[https://en.wikipedia.org/wiki/Telescript_(programming_langua...](https://en.wikipedia.org/wiki/Telescript_\(programming_language\))

[1] Yes I know Erlang exists. I wish more people would use it.

~~~
peterwwillis
We'll keep poorly reinventing distributed computing features until we have a
real distributed operating system. We're actually not that far off from it,
but good luck convincing a mainline kernel dev to accept your patches.

------
systemBuilder
I think the problem with a lot of these ideas is that the value of fork() is
only marginally higher than the value of starting a fresh process with
arguments on a remote machine. The complexity of moving a full process to
another machine is 10 times higher than just starting a new process on the
remote machine with all the binaries present already.

Quite frankly, vfork only exists and gets used because it's so damned cheap to
copy the pagetable entries and use copy-on-write, to save RAM. Take away the
cheapness by copying the whole address space over a network, adding slowness,
and nobody will be interested any more.

And both techniques are inferior to having a standing service on the remote
machine that can accept an RPC and begin doing useful work in under 10
microseconds.

RPC is how we launch mapshards at Google with a worker process that is a long-
running server and it just receives a job spec over the network and can
execute right away against the job spec.

------
abotsis
Also of interest might be Sprite- a Berkeley research os developed “back in
the day” by Ken Shirriff And others. It boasted a lot of innovations like a
logging filesystem (not just metadata) and a distributed process model and
filesystem allowing for live migrations between nodes.
[https://www2.eecs.berkeley.edu/Research/Projects/CS/sprite/s...](https://www2.eecs.berkeley.edu/Research/Projects/CS/sprite/sprite.html)

~~~
kens
To be clear, John Ousterhout was the creator of the Sprite operating system. I
was one of the grad students working on it.

------
TazeTSchnitzel
In essence this is manually implementing forking — spawning a new process and
copying the bytes over without getting the kernel to help you, except over a
network too.

It reminds me a bit of when I wanted to parallelise the PHP test suite but
didn't want to (couldn't?) use fork(), yet I didn't want to substantially
rewrite the code to be amenable to cleanly re-initialising the state in the
right way. But conveniently, this program used mostly global variables, and
you can access global variables in PHP as one magic big associative array
called $GLOBALS. So I moved most of the program's code into two functions
(mostly just adding the enclosing function declaration syntax and indentation,
plus `global` imports), made the program re-invoke itself NPROCS times mid-
way, sending its children `serialize($GLOBALS)` over a loopback TCP
connection, then had the spawned children detect an environment variable to
receieve the serialized array over TCP, unserialize() it and copy it into
`$GLOBALS`, and call the second function… lo and behold, it worked perfectly.
:D (Of course I needed to make some other changes to make it useful, but they
were also similar small incisions that tried to avoid refactoring the code as
much as possible.)

PHP's test suite uses this horrible hack to this day. It's… easier than
rewriting the legacy code…

~~~
trishume
Indeed! I was talking to someone about their attempt to fork a Graal Java
process and recreate all the compiler and GC threads, and I said if I had that
task I'd be tempted to just use my new knowledge to implement a fork that also
recreated those threads rather than trying to understand how to shut them down
and restore them properly.

------
vladbb
I implemented something similar ten years ago for a class project:
[https://youtu.be/0am-5noTrWk](https://youtu.be/0am-5noTrWk)

------
new_realist
See [https://criu.org/Live_migration](https://criu.org/Live_migration)

------
jka
This reminds me a little bit of the idea of 'Single System Image'[1]
computing.

The idea, in abstract, is that you login to an environment where you can list
running processes, perform filesystem I/O, list and create network
connections, etc -- and any and all of these are in fact running across a
cluster of distributed machines.

(in a trivial case that cluster might be a single machine, in which case it's
essentially no different to logging in to a standalone server)

The wikipedia page referenced has a good description and a list of
implementations; sadly the set of {has-recent-release && is-open-source &&
supports-process-migration} seems empty.

[1] -
[https://en.wikipedia.org/wiki/Single_system_image](https://en.wikipedia.org/wiki/Single_system_image)

~~~
macintux
That was the original concept that led Ian Murdock and John Hartman to found
Progeny. The idea was that overnight, while no one was working at their desks,
companies could reboot their Windows boxes into a SSI network of Linux nodes
to run parallel compute tasks.

Roughly, anyway, I got the sales pitch 20 years ago so my memories are fuzzy.
I wasn’t remotely sold on it but was so anxious to work for a Linux R&D
company in Indianapolis of all places that I accepted the job anyway.

Sadly we didn’t get far on the concept before the dot-com crash. Absent more
venture capital we pivoted to focus on something we could sell, Progeny Linux,
and tried to turn that into a managed platform for companies who wanted to run
Linux on their appliances.

------
londons_explore
Bonus points if you can effectively implement the "copy on write" ability of
the linux kernel to only send over pages to the remote machine that are
changed either in the local or remote fork, or read in the remote fork.

A rsync-like diff algorithm might also substantially reduce copied pages if
the same or a similar process is teleforked multiple times.

Many processes have a lot of memory which is never read or written, and
there's no reason that should be moved, or at least no reason it should be
moved quickly.

Using that, you ought to be able to resume the remote fork in milliseconds
rather than seconds.

userfaultfd() or mapping everything to files on a FUSE filesystem both look
like promising implementation options.

~~~
mlyle
If you just pull things on demand, you're going to get a lot of round-trip-
time penalties to page things in.

I think you should still be pushing the memory as fast as you can, but maybe
you start the child while this is still in progress, and prioritize sending
stuff the child asks for (reorder to send that stuff "next"), if you've not
already sent it.

~~~
trishume
Yah that is indeed a super important optimization for avoiding round trips.
CRIU does this and calls it "pre-paging", their wiki also mentions that they
adapt their page streaming to try to pre-stream pages around pages that have
been faulted: [https://en.wikipedia.org/wiki/Live_migration#Post-
copy_memor...](https://en.wikipedia.org/wiki/Live_migration#Post-
copy_memory_migration)

edit: lol I didn't realized that isn't CRIU's wiki since they just linked to a
Wikipedia page and both use WikiMedia software. This is the actual CRIU wiki
page, and it's way harder to tell if they do this, although I suspect they do
and it's in the "copy images" step of the diagram
[https://criu.org/Userfaultfd](https://criu.org/Userfaultfd)

------
rapjr9
There was a lot of work on mobile agents 20 years ago, Java programs that
could jump from machine to machine over the network and continue executing
wherever they landed. The field stagnated because there were some really
difficult security problems (how can you trust the code to execute on your
machine? How can the code trust whatever machine it lands on and use it's
services?). I think later work resolved the security issues but the field has
not re-surged. Might be a good place to start to see what the issues and risks
of mobile task execution are.

------
carapace
"Somebody else has had this problem."

Don't get me wrong, this is great hacking and great fun. And this is a good
point:

> I think this stuff is really cool because it’s an instance of one of my
> favourite techniques, which is diving in to find a lesser-known layer of
> abstraction that makes something that seems nigh-impossible actually not
> that much work. Teleporting a computation may seem impossible, or like it
> would require techniques like serializing all your state, copying a binary
> executable to the remote machine, and running it there with special command
> line flags to reload the state.

------
lachlan-sneff
Wow, this is really interesting. I bet that there's a way of doing this
robustly by streaming wasm modules instead of full executables to every server
in the cluster.

------
saagarjha
It’s touched on at the very end, but this kind of work is somewhat similar to
what the kernel needs to do on a fork or context switch, so you can really
figure out what state you need to keep track of from there. Once you have
that, scheduling one of these network processes isn’t really all that
different than scheduling a normal process, except the of course syscalls on
the remote machine will possibly go to a kernel that doesn’t know what to do
with them.

------
p4bl0
Very cool :). Apart from Plan9 that many people already talked about here, it
also made me think of Emacs `unexec` [0].

[0]
[http://git.savannah.gnu.org/cgit/emacs.git/tree/src/unexelf....](http://git.savannah.gnu.org/cgit/emacs.git/tree/src/unexelf.c)

------
peterkelly
There's been a bunch of interesting work done on this over the years. Here's a
literature survey on the topic:
[https://dl.acm.org/doi/abs/10.1145/367701.367728](https://dl.acm.org/doi/abs/10.1145/367701.367728)

------
tpetry
It‘s a really nice idea. But when reading it i came to the conclusion that web
workers are a really genious idea which could be used for c-like software
equally good: Everything which should run on a different server is an extra
executable, so this executable would be shipped to the destination, started
and then the two processes can talk by message passing. This concept is so
generic there could be dozens of „schedulers“ to start a process on a remote
location, like ssh connect, start a cloud vm, ...

------
zozbot234
The page states that CRIU requires kernel patches, but other sources say that
the kernel code for CRIU is already in the mainline kernel. What's up with
that?

~~~
en4bz
CRIU can use two mechanisms to detect page changes. One is the soft-dirty
kernel feature which is mainlined and can be accessed via /proc/PID/pagemap
[1]. The other is userfaultfd which is only partially merged in the newest
kernel. userfaultfd lacks write detection which the article mentioned. My
understanding is that using pagemap requires the entire process to be frozen
while it is scanned for memory changes and the memory is copied while uffd
allows for a more streaming/on-demand approach that doesn't require stopping
the entire process.

[1] [https://www.kernel.org/doc/Documentation/vm/soft-
dirty.txt](https://www.kernel.org/doc/Documentation/vm/soft-dirty.txt)

------
bamboozled
I could imagine having a build system which produces a process as an artifact
and then just forks it in the cloud without distributing those pesky archives!

~~~
keeganpoppen
that... is a really fucking cool idea. i mean, it's of limited utility in the
sense that builds really should generally work most of the time & be
reproducible, so even though there is a bit of wasted effort, bandwidth, etc.
in running the same build in multiple places at once, it's not like the
_worst_ thing in the world... but it just feels like there have to be cool /
"useful" scenarios where you want to distribute a _thing_ , but not the
environment, etc. required to _build_ the thing. oh wait, that already exists:
it's called "a binary"... lol. iono.

i mean i guess if you have a process byte-for-byte frozen in time, you're a
bit more certain (provably so? maybe?) maybe that when it is resurrected on
some remote host that remote process and local process (as of t_freeze) are in
~"exactly" the same state?

or if you continuously snapshotted a process while it was running, and then
when it crashes you (hopefully) have a snapshot from right before when it
crashed that can be replayed, modified, etc. at will? but that's a bit of a
stretch, and there are simpler ways of accomplishing that for the most part...

the only other thing i could think of is that this could allow you to run
completely diskless servers, because you can just beam all the processes in,
without ever having to "install" anything, beyond copying all the shared libs,
etc. which i guess could get the "1 have one of these and i want 100 of them,
in the cloud" latency down by an order of magnitude or two maybe.

------
anthk
What I'd love it's bingding easily remote directories as local. Not NFS, but a
braindead 9p. If I don't have a tool, I'd love to have a bind-mount of a
directory from a stranger, and run a binarye from within (or piping it)
without he being able to trace the I/O.

If the remote FS is a diff arch, I'd should be able to run the same binary
remotely as a fallback option, seamless.

------
touisteur
I wonder whether the effort by the syzkaller people (@dyukov) could help with
the actual description of all the syscalls (that the author says people gave
up on for now, because too complex), since they need them to be able to fuzz
efficiently...

------
YesThatTom2
Condor did this in the early 90s.

------
cecilpl2
This is similar to what Incredibuild does. It distributes compile and compute
jobs across a network, effectively sandboxing the remote process and
forwarding all filesystem calls back to the initiating agent.

------
crashdelta
This is one of the best side projects I've ever seen, hands down.

~~~
trishume
<3

------
sharno
I think Unison [0] is going to make this trivial

[0] [https://www.unisonweb.org/](https://www.unisonweb.org/)

------
justicezyx
There is some company using CRIU to implement general purpose process live
migration (not specifically VMs).

------
concernedctzn
side note: take a look at this guy's other blog posts, they're all very good

------
pcr910303
This makes me think about Urbit[0] OS — Urbit OS represents the entire OS
state in a simple tree, this would be very simple to implement.

[0]: [https://urbit.org/](https://urbit.org/)

------
rhabarba
I love how people reimplement Plan 9 (poorly).

------
cjbprime
Amazing work.

~~~
ignoramous
Yep. Reminds me of u/xal praising u/trishume:
[https://news.ycombinator.com/item?id=20490964](https://news.ycombinator.com/item?id=20490964)

------
crashdelta
THIS IS REVOLUTIONARY!

~~~
systemBuilder
Hardly. See my rebuttal above.

------
totorovirus
teleforkbomb..

------
anticensor
hoard() would be a better name.

~~~
jabedude
I don't understand, could you explain? `tfork(2)` seems more in line with
Linux naming conventions.

~~~
anticensor
Telefork as a verb does not map to a real word concept. Hoard fits better, as
in hoarding another computer for your process.

~~~
carapace
Ha! I thought you had misspelled _horde!_ ;-)

~~~
buckminster
I think you mishurd.

