Hacker News new | past | comments | ask | show | jobs | submit login
The memory remains: Permanent memory with systemd and a Rust allocator (darkcoding.net)
132 points by petercooper on Jan 10, 2024 | hide | past | favorite | 76 comments



Without specifying a particular representation[0] on the struct you want to put in "permanent memory", this seems liable to break any time you recompile the code with different optimization flags, or with a newer or older compiler that may optimize differently, or even if you just change seemingly unrelated code that then triggers or de-triggers some optimization.

[0] https://doc.rust-lang.org/reference/type-layout.html#represe...


The rkyv crate has stable in-memory representations for many types in the Rust standard library: https://rkyv.org/

I feel like it is the missing piece here because you could do hash table lookups in shared memory, too!


Yup, at a minimum they should apply #[repr(C)] to avoid this optimization.


Worse, it breaks all pointers since you can't guarantee `mmap` can give you the same address.


You absolutely can, with `MAP_FIXED`. But it's not trivial to do because you'd have to handle the case where the rtld and/or the kernel has already put something in memory where you want to put down your map. So it'd be tricky to implement, but definitely possible.


That can be fixed with a linker script, to reserve the range using ELF headers.

But then you're going to hit other problems. This isn't a new idea. Presumably one of the reasons you're restarting your server is to upgrade it ...


I think you could fixup the pointers if you tried. It's more or less the same problem as compacting GC.


Certainly there are solutions, but they aren't transparent, and at some point you have to ask "would serialization be easier?"


This is exactly what I thought: why not just save state to a file?


You can detect when it doesn't, though, I think. If memory serves you can request a specific memory address to mount it at and it'll do it if it's not already mapped. It should (in theory?) return an error in such a case, no?


I believe on conflict it will just fall back to placing it where it can by default. It will return the address it actually used, so you can catch the mismatch.


Good point, thank you! I added `repr(C)`.


At the end, std::thread:: is rendered as std:<thread emoji>:

Cool article, I learnt about a new systemd and custom allocators in Rust !


Ah, it's "thread", thanks! I probably need my morning coffee ... I don't know rust to any detail and was wondering what is std::spool.


I had no idea systemd can do that. It sounds kinda exploitable. Is the ability to do that even a good idea?


Systemd is awesome, but at this point it has so much functionality, that it's hard to know what it can do.

It needs to _seriously_ improve discoverability and documentation.


FWIW, there are other (portable) services which implement an fd store too.


Fd stores are an awesome way to do this too. We used to start up a new version of the process and then they would have to find each other and pass the fds. It's pretty complicated and it's a weird model for the supervisor.


The functionality is key for socket-activated daemons which is a Big Idea behind systemd: it can launch daemons on-demand instead of at boot, allowing for faster boot times


> The functionality is key for socket-activated daemons which is a Big Idea behind systemd

https://en.wikipedia.org/wiki/Launchd#Socket_activation_prot...


It is not a launchd innovation, its from the late 1970s: https://en.wikipedia.org/wiki/Inetd


I don’t fully understand the relevance of that paragraph, but launchd and systemd both work this way, both for similar reasons

I do believe that launchd came first: https://news.ycombinator.com/item?id=2565780


I only linked it because launchd is the source of the "Big Idea", predating systemd by ~5 years.

Perhaps I misunderstood your comment in interpreting it to say socket activation's an original systemd concept.


https://en.wikipedia.org/wiki/Inetd came out in 1978, if we're keeping score.


True! But the launchd style of socket activation integrating into the dependency graph of service startup is what's novel and alleviates some significant challenges to concurrent starting of interdependent services...


The person who raised it just said socket-activated daemon is "a Big Idea" behind systemd. They didn't say or imply that systemd invented it, just that systemd is based on that pattern.


Yes. But the Iron Law of systemd is that you should never mention it unless you’re prepared for a tangential argument about systemd and other init systems.


Yes! I should have explicitly spoke to that, because it wasn't the source of the idea

I was trying to respond to "Why does systemd have to store file descriptors?" with "That's actually the entire point of its existence"


I don't understand why faster boot times was ever a priority in the first place. I know it's a metric that linux ricers sometimes like to compete on, but normal users just suspend their computers when not in use 99 times in 100.


Linux use in computers that can suspend is negligible compared to the number of instances of Linux in random gadgets that you expect to work as soon as you turn them on.


More than once I've known I have an early meeting and got to my desk just before the clocked turned - but by the time I got into the meeting several minutes had past. I work with people around the world, sometimes you need someone in Germany, India, Australia, and Basil to all meet at the same time - there is no way to pick a good time for everyone, and the closer you get for one person the worse things get for someone else so you end up with barely acceptable times for everyone as the best compromise.


Its not faster boot that matters but that the system can have many more services configured and pending activation. It makes a much more dynamic OS possible, which matters in a world of mobile computers that dock and connect to new hardware and networks all the time.


I was thinking along the same lines---doesn't it allow for "permanent memory leaks"? I mean, if someone has just bugs in their (possibly non-Rust) code, not necessarily trying to exploit anything.


Why jump through hoops with this special systemd fd when you can just create a regular file somewhere and mmap that?


I can’t say I understand that, either. If the goal is to prevent the mmap’ed file from persisting on disk, using tmpfs handles that. If the goal is to prevent other processes from accessing the mmap’ed file, the systemd fd store would have the same issue.


I guess they don't want something that's visible through the file system. This method gives you anonymous storage that is presumably only accessible to something launched via the same systemd service unit.


Deleted files are not visible through the file system either. Deleted files in tmpfs are not stored on disk.

The biggest reason to use memfd_create is that you can seal the memfd so it cannot be resized, and then it's safe to mmap on another side of a trust boundary; if you mmap an arbitrary attacker-controlled FD, the attacker(/poorly written client) can make you segfault. Wayland uses sealed memfds to pass framebuffers.


Yeah, I really don't see it. I think it's just because systems don't come with tmpfs.

Of course tmpfs storage does get written to disk via swap.


It's the "fancy new thing" and rust people having to reinvent the wheel a few dozen times.


Just wanted to observe that there's no need to stress about allocating exactly the amount of physical memory you'll need in memfd_create(). Linux will always allocate pages on-demand (regardless of overcommit settings), so any unused virtual memory costs you essentially nothing. Given that, you should feel free to allocate at least several GB in a memfd (I've actually done this with 32TB, but that large an amount can conflict with overcommit policy, which you may or may not have control over). In general (for 64-bit address spaces) you should never need to worry about reallocating a memfd based on the amount of physical memory you need.


Just curious, Why not simply use IPC/ SysV shared memory, which has been part of Unix like kernels for a long time already?


Because it is Rust, and Rust people have a severe case of the not-invented-here syndrome


Just represent the data in Arrow memory format, then you can operate on it directly, no ser/de required.


This seems like a very convoluted way to avoid saving a string to a file?


The article is more a "see that crazy thing you can" do then a "thats a good idea".

It also introduces some neat little known tech.

Even if "serializing" state for carrying it over with a controlled restart the File Descriptor Store might still be an interesting case use (in some cases, in others it might a problem). Also you probably would dump bincode or similar for such a "carry over state" serialization, there is very little reason to go with strings.

Through there are some major differences:

- PRO: This approach works with types which do not implement any serialization mechanism or e.g. cyclic data structure. At least theoretically you could carry in-progress futures over a restart with that (practically it's a horror story to try to do so which likely won't work most times).

- CON: This approach has issues with types containing any form of pointers, references and similar. They all also need to use the special Allocator and then there are issues if after the restart the mem file isn't mapped to the exact same address range. This makes it way harder to use, bugs can be subtle introduced and lead errors you normally can't have in rust without unsafe code.

- PRO: You don't have to move/copy anything around, so if you e.g. have some huge splatted (i.e. flat in memory) structures this can safe quite a bit on upstart time. (idk. lets say some 1GiB bit mask for some kind of caching)

- CON: Debugging the carried over state, or e.g. copying it between systems is not possible or at least much more complex. Reboots might also be an issue (I have to check).

In generally for many use-cases this is pointles, and for the times where it's not pointless it's often either brittle or too limited and in turn better avoided. Through for a very few use-cases it can be a huge boon.


PS:

- PRO (& CON): You don't need to take an explicit action to "save" the state, you do the action once when setting it up at the begin of the program, it's also a CON as depending on usage in case of a hard stop you can end up with a state with partial writes to it (can't happen if only used on clean shutdown)

PPS:

- Using mmap with MAP_FIXED to get a fixed address every execution to make pointers work _is a potential security vulnerability_ (it should have been called MAP_REPLACE IMHO). On some platforms you can use MAP_FIXED_NOREPLACE to safely do so, but I have no idea how to choose an address as there is always a risk of address layout randomization getting in your way.

- on platforms where MAP_PRIVATE is implemented using a copy-on-write mechanic you could have a pattern where on a start without a fd you create the fd and load some huge data into it and then restart, when started with fd you mmap it with MAP_PRIVATE allowing you to change the data but also to always go back to the clean state by restarting. No idea for what use-case that makes sense but I found it interesting (also haven't tried it out).


> I have no idea how to choose an address as there is always a risk of address layout randomization getting in your way.

Where are you using mmap(MAP_FIXED) without first calling mmap(PROT_NONE) to get the base address?


Then you'd have to save and load the file. Here you can, at runtime just write code that works as usual. You can kill the process, and come back in a day, and start a new process that has the same objects saved & ready for use. (Ed: ah, systemd limitation, it will only hold data while restarting, alas, but no concrete reason it has to be so short.)

This seems like way less work, seems way less convoluted than writing a whole intermediary layer to serialize objects out into some whatever format on disk, & read them back. The objects are just there, primed, ready to go. It harkens back to single level store days, when disk-backing memory was done by the OS (multics). https://gunkies.org/wiki/Single-level_store

Another way of looking at this is as a return to the old days before we made everything convoluted.


Not really - you have to completely change your data model to avoid pointers which won't save correctly (eg. using a byte array for name in the article) compared to just slapping `#[derive(Serialize, Deserialize)]` on your structs. Not to mention all the extra unsafe code.


Easy to cast stones. (Unsafe code especially seems like a very low quality dig to me, for an allocator: theres going to be some low level base to a system. And not like glibc was authored in a memory safe language either.)

Sure there's problems. Unix won and multics lost, and we have half a century of accrued complexity around that, and a couple weekends of hacking isn't going to get us all the way back to a perfect Single Level Store.

Is this something we can and should all switch to? No. The costs are high, it's not practical. But I think over time, especially with the rise of CXL and rdma resurgent, we'll see zero-cooy zero-serialize techniques like this grow back & be huge performance wins for us, be much simpler to handle, even with the adaption we'll have to do.


If you changed the global allocator to this... and put the stack on it... any reason you couldn't use pointers mostly as normal. You'd run into issues as soon as a library used mmap, but other than that...


If you put the global allocator on this there is a huge load of other issues you can have.

Like needing to allocate a fixed amount of memory for your whole program upfront or juggling multiple fds.

Or that there is some state you can not carry over even with that and if you try you can easily run into soundness issues.

Or that the state saved might be "in between" operations in a very unexpected way and in turn unsound.

etc.


The common place to use something like this would be to mmap an existing external data structure. There are a number of existing mmap-able 0copy k/v library/db formats that fit the bill here.


> Like needing to allocate a fixed amount of memory for your whole program upfront

Can I not do the normal virtual memory thing and allocate far more memory than I actually have hardware for?

> Or that there is some state you can not carry over even with that and if you try you can easily run into soundness issues.

I mean, fair... you're going to have to be careful what you save either way but making saving things the default does seem like more of a footgun.

> Or that the state saved might be "in between" operations in a very unexpected way and in turn unsound.

This applies even if you only use it for some structs? This whole mechanism seems like it would only be sound if it was only used after clean exits.


> sound if it was only used after clean exits.

yes, except maybe if a Drop constructor is run unexpectedly on the exit or similar

generally I think using it with anything which has pointers and/or runs Drop in it is brittle and prone to bugs

in turn most things which do not have pointers should be fine with a clean exit

and anything which only consists of memory where any bit combination is always valid should always be sound even on a abort (e.g. a `[u8]` allocated directly in that memory region or a `[T]` where `T` only has primitive non-allocating types)


You are better off with CRIU at this point, IMO.


I love CRIU and think it's way under appreciated. Both Node and JVM have some cool snapshotting of their stdlibs too that is somewhat similar, and if you really want to optimize your load times, you can go bundle more stuff into these ready to go snapshots, which is so super cool to me. Checkpoint Restore In Vm-space! The patterns here rock.

I do think though that these patterns of having file descriptors of stuff that can be passed around, and having semi-seamless "just map in the object whole" have a ton of super powered ultra-high performance wins to it that deserve a lot of exploration & effort that justify their pursuit.

But Checkpoint Restore in Userspace/criu just works & is great for fast loading a thing. Still, we need at least some fd passing for apps like no-downtime-restart (or smart draining + SO_REUSEPORT); that's a real need. Going further and passing around memfd's is cool.

It's gotten dropped, but for example Apache Arrow had a Plasma object store for this stuff for a while, that I wish was still on the scene (or something like it). Arrow is a particular data-format with it's own libraries, versus what we have here which is coding-as-usual(-ish); 1/2 patterns. The second might be harder but I still think it has potential.


Except that a lot of use third party code who have structs that aren't serializable.

I'm working on a project now where I'm maintaining half a dozen forks just to add a single line.


I think I'm misunderstanding what this can be used for, but something I'd love to see become "perfect" is preserving a GUI application state across reboots/logins/etc. Especially with flatpak GUI apps.


This is called image computing and Lisp/Smalltalk did it. Emacs has a form of it called undump.

I dislike it because it violates the "rule of least power". Basically, if you never have to validate your internal data, you can never tell it's corrupt.


This is going to be possible with CRIU and Wayland applications. I saw a PoC with Qt that did this.


No part of systemd FD store is gonna survive a reboot.


I imagine it would also be an interesting way to recover/continue on from a panic/segfault


The problem is that this can "carry over" structures which are in a unsound state. Especially if abort was done due to runtime sanity/soundness checks failing. Or a segfault due to unsound structures etc.

As far as I can tell you should only use it for "splatted" data, maybe allowing gaps.

E.g. using it to not needing to load again some kind of multiple 1GiB bit mask which are just a 1GiB blob of bits each would probably work grate. Using it on structs which do not contain any pointer (i.e. no Strings, no Vecs, no &str etc.) would work okay. But the moment any pointers are stored in data carried over that way things become a huge mess and are basically fundamentally unsafe as they can easily become unsound.


Sure..assuming the cause of your segfault was not somewhere in that data...as it surely was


Yes, you'd need a "safe mode" restart to avoid repeatedly failing and restarting in a tight loop.


So sort of like…copying a good safe state of .data from an ELF file to RAM, clearing .bss in RAM, and jumping to a good safe instruction address, like the entry point as specified by the ELF file?

</sarc>


This works when no writeable file system is available.


Then mount a tmpfs and use a normal file like a sane person.


It involves systemd. 'nuff said.

Edit: systemd is a religion now, and it's blasphemy to speak against it?


> systemd is a religion now, and it's blasphemy to speak against it?

No, yours is just a low-effort comment that doesn't meaningfully add to the discussion.

I strongly disliked systemd (and its proponents) for a long time. Until I realized it just doesn't matter, and all the hate it gets is just boring and takes time away from useful things.


No, and yes. Systemd is needed to put linux under the thumbs of glowies. The people defending it are paid by your tax dollars


Ah, you simply create a file in memory and map it back to memory.


Systemd can store files / descriptors.. just give it a desktop environment and release it as SystemdOS and be done with it!

build all the functionality into systemd!


Wow, this very interesting - can imagine all sorts of caches etc.


Just a few Gb of ram and the feds can audit your state when their zeroday crashes your machine, what a deel.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: