Hacker News new | past | comments | ask | show | jobs | submit login
Unfork() (github.com)
568 points by salgernon 7 months ago | hide | past | web | favorite | 114 comments

> Permission to read from or write to another process is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check

I presume that is necessary for this in addition to belonging to the same UID?

> As far as I know process_vm_readv isn't even detectable if the agent process is more privileged than the examinee process—so you're free to manipulate your private copy of the application in the comfort of your own address space.

Interesting. This would be really useful in debugging. Many issues don't reproduce except for in specific configurations. Having access to the memory dump of the live process "streamed" to the debugger would be great!

"a PTRACE_MODE_ATTACH_REALCREDS check" and "same UID" are roughly synonyms. ("Roughly" because the actual check also permits privileged processes and denies some special cases such as processes that have done a setuid. REALCREDS, if I remember right, is in contrast to the check done on certain files in /proc.)

It's the same check as ptrace itself, so the intuition is "can I strace or gdb this process."

The big exception to this is inside containers where it is often disabled by default.

AIUI Docker containers by default deny the ptrace syscall (and presumably process_vm_readv/writev), they don't change the permission check. So /proc/$pid/mem, which uses the same permission check, ought to work.

(This also means that you don't want or need CAP_SYS_PTRACE to get gdb/strace working in Docker, that lets you ptrace anything and also coincidentally turns off the syscall filter. Just turn the filter off, that works without privileging the processes in the container.)

Or if you have an LSM such as Yama or SELinux set to deny ptrace globally.

Yama lets you ptrace a child process but not others.

By default, yes.

You can also disable ptrace() completely.

(Grep for "ptrace_scope" in the ptrace(2) man page for details.)

> Having access to the memory dump of the live process "streamed" to the debugger would be great!

This is also possible with standard debuggers, such as GDB: It can attach to a running process and not only examine the memory, but also debug (stop, pause, skip, ...) the stack trace and control flow. Usage is as simple as gdb -p $(pidof my_running_program)

You might also find https://rr-project.org/ interesting -- it lets you step backwards too.

Interesting. Without letting storage explode, I can't think of an easy way to do this since computation isn't really reversible.

You can keep track of the before and after states whenever you do something nonreversible, like a syscall.

And even then you can probably just store the diff instead of a full image. And even then if you run out of memory you can just start evicting the oldest snapshots.

There's a lot of nonreversability in the way a processor interacts with memory though.

gdb supports reverse debugging too

Right. But the difference is modifying any memory under GDB will be seen by the process. It's not copy-on-write

Isn't that a feature for debugging tho?

Absolutely. I was arguing for "unforking" to say a patched version of the process and verifying a fix without actually modifying the live process.

I know Windows doesn't get too much love here. But we have to admit that Win32 has already this kind of feature since ages: Process access routines such as OpenProcess() [1] coupled with ReadProcessMemory() [2] will do the job in a clean way.

Taking a snapshot of other processes is also a basic use case of this family of functions [3].

[1] https://docs.microsoft.com/en-us/windows/win32/api/processth...

[2] https://docs.microsoft.com/en-us/windows/win32/api/memoryapi...

[3] https://docs.microsoft.com/en-us/windows/win32/toolhelp/taki...

I appreciate win32 much more than a typical HN user. But let's not fall into the trap of seeing it as the only way or that certain features haven't also been on Unix for decades.

> that Win32 has already this kind of feature since ages: Process access routines such as OpenProcess() [1] coupled with ReadProcessMemory() [2]

And Unix has had ptrace(2) for a very long time too, which will accomplish the same thing. Also you can read another process's memory through /proc. So there are multiple paths.

Also keep in mind, the most "legit" use case for this stuff is for writing a debugger. If you have used a debugger you are already relying on this functionality being there.

Edit: I am not aware of a win32 equivalent of this thing that lets you easily handle another thread's page faults in user mode though. That seems a little wacky. You can use debugger APIs to handle "STATUS_IN_PAGE_ERROR" and "STATUS_ACCESS_VIOLATION", which might get you there.

> And Unix has had ptrace(2) for a very long time too

A flaw in the Linux implementation, though, prevents one to run ptrace on a process that is using ptrace itself.

As more programs use ptrace, this flaw is becoming quite annoying.

The APIs are broadly equivalent, see https://nullprogram.com/blog/2016/09/03/

What unfork does is more complicated than a mere read, though. I'm still not entirely clear on its use case, but it does all sorts of tampering that the code comments describe as "cursed." It also seems to be specifically targeting applications which have anti-debug measures.

"Debuggers" implemented mostly through ReadProcessMemory / mach_vm_read / process_vm_read/pread are all intended to defeat anti-debug mechanisms, though; I'm not clear how unfork is meant to make the process simpler, but it looks intriguing.

process_vm_read and its equivalents allow you to read data whose layout you already know. unfork allows you to read data whose layout you don't know or that is partially generated by calling the accessors that the application already has for that data instead of reverse-engineering them and doing the transformation yourself.

unfork appears to be unique in that it creates the illusion of mapping the target process's memory into the source. This is achieved through the use of userfaultfd, which allows a Linux process to mark memory as missing, to receive notifications when other threads attempt to access missing memory, and to provide the contents of that memory in response to such faults. This mechanism is quite powerful and flexible, and Windows does not have a direct equivalent of this.

The closest equivalent I can think of in Windows would be to mark pages as no-access and use vectored exception handling to trap access faults. During a fault, the exception handler would fill in the page (e.g. via ReadProcessMemory) and flip the page protection to read or read/write.

Since you wouldn't want to flip the page protection until after the memory had been updated, you would probably have to used a pagefile-backed section to update the memory at a separate virtual address with independent page protections. And unlike the userfaultfd approach, this mechanism would not help for cases where the mirrored memory was being passed to a syscall.

I think Linux could do this too, via a signal handler, but AFAIK the Linux memory manager does not efficiently support per-page access protection (unlike Windows). In the worst case, each page would get its own vma structure in the kernel, which would be quite expensive. So absent userfaultfd, the Windows memory manager probably has the edge.

I mean, /proc/foo/mem has also exposed that feature for forever on Linux.

How clean is it? Can you simply exec the result?

Glibc used to have unexec(), which is fairly old, but it was removed because nobody used it (except Emacs, and there were better solutions to the problem it was solving).

> How clean is it?

It's as clean as any official Win32 API which uses their privilege system to restrict/allow accesses to each and any bit of information on the process state and/or memory.

> Can you simply exec the result?

This is possible using CreateThread() [1] which creates a remote thread inside another process execution context.

[1] https://docs.microsoft.com/en-us/windows/win32/api/processth...

> Glibc used to have unexec()

My understanding is that unexec() was more about making a snapshot of the whole process state to an executable on disk.

That is my understanding too. Solaris had a flag for dldump (https://docs.oracle.com/cd/E19455-01/806-0627/6j9vhfmop/inde...). Emacs moved to a portable dumper (maybe inspired from XEmacs)

Emacs had its own unexec().

And for something completely different - but in the same vein - a stand-alone 'cd' binary - https://github.com/robertswiecki/extcd - enjoy!

For those who aren't familiar with the significance of this, "cd" is a shell builtin because the working directory is per-process state. So while it's perfectly valid to write a program that does a chdir(2) and then exits, it's only changing its own working directory, which is pretty useless.

isn't this more or less the same as

    gdb -batch -n -ex 'call chdir("whatever")' -p $$

That does seem to be conceptually pretty much what it's doing. Except your version works on more architectures.

It's a simple example of ptrace() though.

Totally, off topic, but it's funny to think about what kind of the feature would be if we put "un-" on each syscall:






Emacs somewhat famously uses "unexec" in its build process, you build a skeletal Emacs in C (mostly the Emacs Lisp implementation), run it, load and compile and process a bunch of Lisp that implements the editor itself, and dump the resulting process memory back out to disk. The result of this eldritch process is the final emacs binary. When you exec emacs, you get an environment that consists of the editor code ready to go.

I'm given to understand that the macOS implementation of malloc had to have special-cased code in it just to support emacs due to this approach.


Because linkers are too easy?

It really looks like some overly clever college student's weird trick that somehow managed to survive for decades in an established product.

It comes from a time of machines executing instructions thousands of times slower than they do now. Literally – thousands. Memory access was about as fast as an instruction execution, so the amount of compute you can justify per unit of data was hundreds of times less than it is now. They did however have virtual memory systems with on demand page fetching.

Also, that machine was being time shared with a dozen or more users.

Launching emacs or TeX on this machine might take tens of seconds without access to unexec(), but only 3 seconds for the freeze dried version.

unexec() was easier at the time. There were no shared libraries, no address space layout randomization. One memory region grew up from the bottom, one down from the top. There was no mmap() jamming mysterious stuff in the middle. Just copy the bottom, copy the top, do magic to adjust the stack for your unexec() call, and write the thing out as an executable.

(Yeah, I excised unexec() from BibTeX back in the ‘80s to port it to a 68k Mac for a coworker, then later implemented unexec() for a Motorola 88k based multilevel secure SysV system in the early ‘90s because launching emacs was driving me insane. I prefer our shiny new future of stupidly fast computers.)

"Removing support for Emacs unexec from Glibc" -- https://lwn.net/Articles/673724/

Interestingly, even if Emacs removes this I see Apple being forced to keep their hack in place as they're not likely to update their version of Emacs anytime soon…

squints Sooo... Elisp has AOT compilation!

The idea is somewhat similar to Android's zygotes, isn't it?


Description: put thread to sleep as long as there is activity on any fd, wake up only when all fds are inactive.

Useful for: Scheduling work to be performed only when server is idle.


Description: Select a random file, load it into the buffer cache, and remove it from file system.

Useful for: Freeing up some disk space in a pinch.


Description: Resurrects the previous child process.

Useful for: Implementing the !! operator in bash.


Description: Invoke signal handler whenever a given fd activates.

Useful for: User space interrupts.


Description: Create a file which refers to an open fd.

Useful for: Implementing /proc/self/fd functionalit.

> unselect(2) > Useful for: Scheduling work to be performed only when server is idle.

Yeah that's nice.

> unsignalfd(2) > Useful for: User space interrupts.

There is libfam. At least on my system, it doesn't have a manual page.

> unopen(2) > Description: Create a file which refers to an open fd.

That does sounds useful, and I don't know any library that does it.

The "unopen" syscall actually exists, albeit under a different name: linkat


I don't think it's the same thing. Linkat is still starting with a file that exists in the filesystem.

In my silly world, unopen() would just take any fd (socket, file, pipe, etc.) and create a file system binding which anyone could open. Kind of like how /proc works on Linux today.

Surely, "unseek" should be called "lhide".

ungetc does actually exist.

Or `unlink`.

ununlink would be pretty useful. Then again, this kind of exists; I have fond memories of a panicky younger me struggling with undelete in DOS.

ununlink is link() on a /proc/<pid>/fd/<fd> entry. Assuming some process still has the file open, that is.

Does that mean that one could implement "undelete" by changing open() so that some central process (let's call it "recycle-bin") also opens a copy at the same time a process opens them (but keeps it open until you send it a "empty" signal), and then calling that link()?

You could use LD_PRELOAD to replace unlink with a version that passes an fd to a running recycle-bin daemon. That approach doesn't really have much, or any, advantage over just moving files to a recycle-bin folder and keeping track of items in there.

It would "automatically empty" the recycle bin at each shutdown though.

Doesn't it have the advantage that "rm" and co are then made automatically undo-able?

(Whereas moving to recycle-bin is a manual process you need to remember to do).

  alias rm trash

Also a manual process -- and per user at that...

link() doesn't work for this use case.

You need linkat() with the AT_SYMLINK_FOLLOW flag enabled.





unbowed unbent unbroken

> Nevertheless, I think that with some effort two allocators or even dynamic linkers could survive together.

Famous last words.

> How limited is this approach?

> A: It's true that meshing address spaces is much harder than copying them. ... [truncated] ... 64-bit systems with ASLR are far more forgiving. Nevertheless, I think that with some effort two allocators or even dynamic linkers could survive together.

That is a really cool side effect of ASLR!

[1]: https://en.wikipedia.org/wiki/ASLR

Does it pause the process whose memory it is copying?

Freezing the process can affect its correct operation. (Sometimes when I need a memory dump of a production java app, I can't take it because can not afford freezing a production app)

Without the freeze, the memory copy we get can be inconsistent.

I have no idea but the FAQ says it's CoW.

If I understand it correctly, it's more of a Copy-on-Read/Write, and as opposed to fork, it's only one-sided: read/write is only detected and results in a copy on the unfork side; if the original process changes memory, it doesn't result in a copy as nothing is monitoring this (the userfaultfd only monitors the unfork side).

The ideal approach would be if it turned the original process memory into "copy on write", and created a paused exact copy of that process. This would give a consistent, immutable, snapshot of the target process memory, without freezing for the duration of actual memory copying.

One could then take a core dump, java heap dump, or similar, of the paused copy process.

I'm curious, why does the tool try to copy the original process memory into the memory of the tool itself, risking a collision? Is it impossible to create a third process - an exact copy of the original process?

The FAQ says:

> all while leaving no ptrace and sending no signal

If this is a design goal, I'm afraid it is indeed impossible to take a snapshot of the original process. As far as I know (I researched the status quo 2 years ago when I needed copy-on-write for VM cloning/forking), the only way to make a snapshot of a process' address space is to invoke the clone (fork) system call. If you need to take a snapshot of another process, then you need ptrace.

But you're absolutely right that the unfork functionality itself can be implemented more robustly by doing this ptrace/fork trick.

I've read the quirky FAQ and would now be interested in what this really does. The Readme mentions some demo code that's not in the repository. It also instructs us to run this on cat and enjoy, but... what would we observe and enjoy?

I wonder if 64 (well, 48) bits of address space is enough to glom together every process on a normal Linux boot without collisions…

If you assume each process needs, say, 16MB of contiguous space, then you get 48-24 bits left, which by the birthday paradox implies you can have up to 2^(24/2 = 12) ~= 4k processes before you start colliding about half the time.

> If you assume each process needs, say, 16MB of contiguous space

Unfortunately I’m not sure that’s a good assumption, due to the stack and heap needing to exist even for statically-linked binaries.

Yes, but simply replace with 'average number of allocations * number of processes' and * 'average size of allocation'

Isn't that a bit similar to what debuggers typically do when you ask them to attach to a given process?

Debuggers touch other processes from afar. This merges the debuggee into yourself.

No, it merges a copy-on-write clone of the debuggee into yourself. That's quite different and, indeed, you can do similar things with it that a debugger could.

If I understand this right,the process being unforked into you won't notice a thing and will happily chime on.

> No, it merges a copy-on-write clone of the debugger into yourself.

But you are the debugger…

> If I understand this right,the process being unforked into you won't notice a thing and will happily chime on.

Right, whereas when running an actual debugger you need to deal with signals and making sure you don't touch memory.

> But you are the debugger…

Ah thanks. My autocomplete didn't like the word "debuggee". Edited!

This should go in the FAQ! Thanks for the concise explainer for folks like me who aren't familiar with the domain.

The inverse of fork is called join, right?

In the git context it's a merge. But Unfork() seems more like a rebase.

In the git context, fork doesn’t really mean anything. That’s just a github thing.

You could build an entire new array of malware with this :D

userfaultfd is an extremely intriguing hammer :)

The write protected mode[1], if it ever gets merged could have some interesting uses for GCs.

[1] https://lore.kernel.org/patchwork/cover/1033856/

I first learned about userfaultfd's utility to GC from https://medium.com/@MartinCracauer/generational-garbage-coll...

I could imagine it being interesting for VMs and DBs too. Imagine a VM whose memory "looks" like virtual memory but under the hood is transparently persisted between invocations.

That's kind of one of the main uses of the API so far by QEMU for live migration of VMs by streaming memory on demand over a network https://wiki.qemu.org/Features/PostCopyLiveMigration

Seems like this would be useful for re-attaching to a shell or process you have either disowned or otherwise lost control of. Would be neat to see some common use cases on the FAQ page.

Is it just me, or is this something that would make Linux even more vulnerable to cyber attacks? What protections are there? Would OpenBSD's pledge prevent something like this?

This is a hack.

I like it.


if only "hackernews" had more hacks like this xd

Since it brings two processes together, maybe spoon would be a better name.

Or, along the same lines, another four-letter word sharing two of its letters with _fork_. But that might be too distracting.


I was thinking Fnnl (pronounced funnel)

But that's probably the name of a startup and would confuse people.

I don't know why spoon would be a better name, but I massively approve of it

Because merging two address spaces together is akin to the act of spooning.

Whitequark is a Wizard. They should team up with Sammy.

Sorry for being an ignorant: Who's Sammy? Surely not: https://news.ycombinator.com/user?id=sammy ?

This Wizard https://www.samy.pl/

Whitequark is better than Sammy in SW. He's also the maintainer of SolveSpace. https://m-labs.hk/software/solvespace/

According to her Twitter, Whitequark prefers feminine pronouns: https://twitter.com/whitequark

By the parent comment I thought Withequark was a group/team of people.

English is also not my native language.

"They" can mean either a group of people, or a single person of unknown gender (or neutral gender).

FYI, whitequark is a woman.

Sounds like alternatives for Git :) But seriously, need to read more on this.

alternative for git? i don't follow. can you elaborate?

I think they just read the title and assumed its a repository fork? No clue either.

Yes, apologies just having fun. At first glance, I thought it looked like Linux Alternatives.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact