
Snappy-start: Tool for launching a Linux process from a snapshot - ingve
https://github.com/google/snappy-start
======
vidarh
Condor [1] has done this since sometime in the 90's. This [2] paper contains
references to a paper about it from at least 97. An interesting part about the
Condor approach is that it can proxy file and network access, so the file-
descriptors etc. can remain open (of course this may not always be what you
want).

[1]
[http://www.sysproject.info/condor.php](http://www.sysproject.info/condor.php)

[2] [http://research.cs.wisc.edu/htcondor/doc/condor-
practice.pdf](http://research.cs.wisc.edu/htcondor/doc/condor-practice.pdf)

------
Animats
This goes way back, to 1960s and 1970s transaction systems such as CICS and
Univac's TIP. The idea is that you have a copy of an initialized transaction
program in a ready to run state, and when a transaction comes in, it's
attached to a copy of the program and executed.

This approach comes in two flavors - in one, each transaction gets a fresh
copy of the program. In the other, the program is used over and over unless it
aborts, in which case a new clean copy is loaded and used. This corresponds to
CGI and FCGI under Apache.

Programs which run as transactions have to be prepared for some unusual
events. Time, for example, suddenly jumps forward when a stored transaction
program starts. File and network connections have to be connected to the right
things. Stateful connections, as with databases, may not play well with this
approach. This can work well when the program doesn't have that much live
state when the transaction program is frozen. A program that has a lot of
startup overhead related to getting external resources may not be a good
candidate. Unfortunately, those are the ones for which this is the biggest
win.

If the main program is program loading overhead, as in Python and Java, maybe
the solution is to go hard-compiled. C, C++, Go, and Rust are hard-compiled,
and you can even hard-compile Java with GCC. Python and Javascript don't have
that option.

(Nightmare of the future - precompiled Javascript snapshot blobs loaded into
browsers.)

------
falcolas
Fascinating. Imagine being able to (re)start Java, Ruby and Python processes
nearly instantly... No more interpreter startup overhead, just right into the
process. There's some interesting implications for deployment as well;
snapshot a containerized process on one machine, reproduce the disk elsewhere,
and restart the already partitioned process.

I would think you'd need rather robust error handling in the program being
snapshotted: there would be several machine states which would change from run
to run.

~~~
stcredzero
What Smalltalk did to enable all of the above you mention: Certain facilities
like networking had to be quiesced just prior to pickling, and then specially
revived on return from snapshot. There were also special hooks for return from
snapshot. (IIRC, there is a Smalltalk method of that name.)

Snapshots becomes much more powerful when combined with a transactional log of
all changes. This gets around certain problems with de-synchronization of
source files with the image. Smalltalks that abandoned the change-log and
returned to conventional source files suffered for it. (Enfin/ObjectStudio)

------
gnufied
Getting this error while running ./make.sh:

    
    
       Hello, world!
       Hello, world!
       * Running test: ./out/example_loader
       Traceback (most recent call last):
       File "tests/run_tests.py", line 57, in <module>
        Main()
       File "tests/run_tests.py", line 40, in Main
        RunTest(['./out/example_loader'], -1, use_elf_loader=False)
       File "tests/run_tests.py", line 34, in RunTest
        subprocess.check_call(['./out/restore'])
       File "/usr/lib64/python2.7/subprocess.py", line 540, in check_call
        raise CalledProcessError(retcode, cmd)
       subprocess.CalledProcessError: Command '['./out/restore']' returned non-zero exit status -11
    

Interestingly the project has Github issues disabled.

~~~
gboudrias
> Interestingly the project has Github issues disabled.

This is pretty weird. Certainly goes against the spirit of free software in my
book. It's not like they _have_ to answer any issues.

(Oh right it's Google. Well that explains that.)

~~~
patrickaljord
Actually there is nothing in the definition of free software that requires a
bug tracker, even by gnu.org and rms standards and they're pretty high.

~~~
teddyh
gboudrias said “spirit”, not “definition”.

------
peterwwillis
This is also sometimes called 'checkpointing', and has been a long-studied
feature necessary for distributed processing systems. There's a lot of ways to
skin this cat. Kernel modules, application libraries, userland tools, and
features to support various kinds of applications (single-threaded, multi-
threaded, preserving TCP connections, etc). Find the one that fits your use
best.

Descriptions of how checkpointing works:
[https://en.wikipedia.org/wiki/Application_checkpointing](https://en.wikipedia.org/wiki/Application_checkpointing)
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.4869&rep=rep1&type=pdf)
[http://www.anchor.com.au/blog/2013/02/overview-of-
checkpoint...](http://www.anchor.com.au/blog/2013/02/overview-of-checkpoint-
and-restore-live-migrating-processes-on-a-linux-system/)
[https://lwn.net/Articles/478111/](https://lwn.net/Articles/478111/)
[https://www.usenix.org/legacy/event/usenix01/freenix01/full_...](https://www.usenix.org/legacy/event/usenix01/freenix01/full_papers/dieter/dieter_html/paper.html)

Individual solutions:
[https://ckpt.wiki.kernel.org/index.php/Main_Page](https://ckpt.wiki.kernel.org/index.php/Main_Page)
[https://code.google.com/p/cryopid/](https://code.google.com/p/cryopid/)
[http://criu.org/Main_Page](http://criu.org/Main_Page)
[http://crd.lbl.gov/departments/computer-
science/CLaSS/resear...](http://crd.lbl.gov/departments/computer-
science/CLaSS/research/BLCR/)

An almost-full list of solutions here:
[http://checkpointing.org/](http://checkpointing.org/)

------
endlessvoid94
I've always wanted the ability to take a snapshot of a running process, say a
ruby repl, that's thrown an unhandled exception, waiting for user input. Send
that snapshot to a friend, who can debug the issue, send back, etc.
Collaboration on a live system over the internet.

Yes, exactly like smalltalk. sigh.

------
ivank
I used [http://criu.org/](http://criu.org/) a while ago to do this for Clojure
REPLs

~~~
gnufied
Looks really nice except having to compile custom Kernels I guess.

~~~
sdsk8
but it is in user space, why the need for compile kernel?

~~~
x7f
Some distros (e.g Arch) don't have CONFIG_CHECKPOINT_RESTORE enabled by
default.

------
mwcampbell
Very cool. Would be even nicer if the snapshotting process created a self-
contained ELF executable, rather than a snapshot file that needs a separate
executable to start it back up. Seems like it should be feasible.

Since the README mentions kentonv, I wonder if this would be useful within
Sandstorm.

~~~
kentonv
Sandstorm is the motivating use case. :)

------
vezzy-fnord
This is a nice PoC tool, but as a checkpointing solution the idea of using
ptrace(2) just reeks of being a huge hack.

If you want to do syscall replaying, check out the Approximate Replay-Trace
Compiler (ARTC).

For a full checkpointing suite on modern Linux, see CRIU or DMTCP.

------
jbverschoor
Sounds like an ideal tool for debugging.

Trace while running, take a snapshot when there's something wrong

~~~
sanxiyn
If you want a debugging tool, try [http://rr-project.org/](http://rr-
project.org/)

~~~
mateuszf
I'd imagine that it has some overhead. So running it on production all the
time would not make much sense. Because sometimes restarting a process makes
it very hard to reproduce the bug. But snasphost / restore would handle that
case. The problem is - what to do with file / network descriptors?

------
haddr
This is really great tool! I wonder if it could work with Java programs...
Can't wait to try it...

------
borkabrak
I'm personally excited to see if I can get this to allow me to persist tmux
sessions over a reboot.

Thoughts?

------
curiousjorge
how does this compare to digitalocean's loading from snapshot? It takes 30
seconds or more to load your snapshot. Is this any different? I'd love to not
rely on DO because this is one of the key selling point of DO, saving a
snapshot and loading it.

~~~
sciurus
This is about snapshots of a single program. What you are describing at DO is
for an entire virtualized computer.

~~~
curiousjorge
if I wanted something like DO, what can I use?

if this works on a single program, what sort of use cases would this come in
handy? Perhaps a java program that has a bit of startup time?

~~~
scintill76
VirtualBox has snapshots. Here's a page with an image:
[http://ehikioya.com/moving-virtualbox-vm-
snapshots/](http://ehikioya.com/moving-virtualbox-vm-snapshots/) The page is
talking about cloning, but you can save/restore snapshots within the same VM
too.

