Hacker News new | past | comments | ask | show | jobs | submit login
Systemd service sandboxing and security hardening (2020) (ctrl.blog)
293 points by capableweb on Jan 19, 2022 | hide | past | favorite | 78 comments



This is a pretty lax policy IMHO, you can go much farther. These days I usually start with this, it's much more strict:

https://news.ycombinator.com/item?id=29976096

Or simply follow whatever `systemd-analyze security` recommends, just make sure you run it on a system with recent systemd.


This is a getting started/101 introduction; it also talks about and recommends systemd-analyze security. There’s a link to part two at the bottom of the article that goes deeper into things.


Which distro has the best out-of-the-box output for:?

  systemd-analyze security
Is there a tool like `audit2allow` for systemd units? selinux/python/audit2allow/audit2allow: https://github.com/SELinuxProject/selinux/blob/master/python...

https://stopdisablingselinux.com/


> Which distro has the best out-of-the-box output

I haven't seen any difference between distributions with the same systemd version. Anything with a recent one should do fine. More recent than RHEL8, mind you (which is on systemd 239): for example, a syscall allow/deny analysis is buggy there and asks you to enable some protections, and then disable them. The same unit is analyzed correctly on my desktop with v250 (I use the popular rolling release distribution).

I haven't seen anything like audit2allow. It's probably not especially necessary because of the difference in philosophies: SELinux is deny by default, while in systemd you're playing whack-a-mole anyway, and are expected to add directives one by one until the application stops working. Unit logs usually make it obvious if something was denied.


The usual way I've seen (and do myself) is to just let the process be killed and have its coredump taken, then `coredumpctl gdb $process_name -A '-ex "print $rax" -ex "quit"'` to get the syscall number, then check `systemd-analyze syscall-filter` for whether I want to allow just that one syscall or the whole group it's in.


> The usual way I've seen (and do myself) is to just let the process be killed and have its coredump taken, then `coredumpctl gdb $process_name -A '-ex "print $rax" -ex "quit"'` to get the syscall number, then check `systemd-analyze syscall-filter` for whether I want to allow just that one syscall or the whole group it's in.

Another approach would be to set SystemCallLog= to be the opposite of SystemCallFilter= (negate each group with ~) and then you'll see the call (and caller) in the journal.


At least on my distro (OpenSUSE Tumbleweed, systemd 249, kernel 5.16.0), SystemCallLog doesn't fire for calls that are filtered; the process is killed first. Even if I set SystemCallErrorNumber=EPERM I don't see the audit log generated. The log only gets generated if the syscall wasn't filtered.


How strange. If I take a service and set

  SystemCallFilter=@basic-io
  SystemCallLog=~@basic-io
and then restart it, I get a bunch of audit SECCOMP log messages (and the service fails to start):

  audit[1185972]: SECCOMP auid=4294967295 uid=63946 gid=63946 ses=4294967295 subj==unconfined pid=1185972 comm="(t-online)" exe="/usr/lib/systemd/systemd" sig=0 arch=c000003e syscall=59 compat=0 ip=0x7f5c632f66c7 code=0x7ffc0000
This is debian bullseye and systemd 247. I wonder if your audit logs are going somewhere else?


FWIU, e.g. sysdig is justified atop whichever MAC system.

In the SELinux MAC system on RHEL and Debian, in /etc/config/selinux, you have SELINUXTYPE=minimal|targeted|mls. RHEL (CentOS and Rocky Linux) and Fedora have SELINUXTYPE=targeted out-of-the-box. The compiled rulesets in /etc/selinux/targeted are generated when [...].

With e.g gnome-system-monitor on a machine with SELINUX=permissive|enforcing, you can right-click the column header in the process table to also display the 'Security context' column that's also visible with e.g. `ps -Z`. The stopdisablingselinux video is a good SELinux tutorial.

I'm out of date on Debian/Ubuntu's policy set, which could also probably almost just be sed'ed from the current RHEL policy set.

> * SELinux is deny by default, while in systemd you're playing whack-a-mole anyway, and are expected to add directives one by one until the application stops working. Unit logs usually make it obvious if something was denied.*

DENY if not unconfined is actually the out-of-the-box `targeted` config on RHEL and Fedora. For example, Firefox and Chrome currently run as unconfined processes. While decent browsers do do their own process sandboxing, SELinux and/or AppArmor and/or 'containers' with a shared X socket file (and drop-privs and setcap and cgroups and namespaces fwtw) are advisable atop really any process sandboxing?

Given that the task is to generate a hull of rules that allow for the observed computational workload to complete with least-privileges, if you enable like every rule and log every process hitting every rung on the way down while running integration tests that approximate the workload, you should end up with enough rule violations in the log to even dumbly generate a rule/policy set without the application developer's expertise around to advise on potential access violations to allow.

From https://github.com/draios/sysdig :

> "Sysdig instruments your physical and virtual machines at the OS level by installing into the Linux kernel and capturing system calls and other OS events. Sysdig also makes it possible to create trace files for system activity, similarly to what you can do for networks with tools like tcpdump and Wireshark.

Probably also worth mentioning: "[BETA] Auditing Sysdig Platform Activities" https://docs.sysdig.com/en/docs/developer-tools/beta-auditin...

A bit of SELinux:

  # /etc/selinux/config
  SELINUXTYPE=targeted
  SELINUX=permissive

  $# touch /.autorelabel  # `restorecon /` at boot
  $# reboot
  $# setenforce 1  # redundant 
 
  $ sudo aureport --avc 
  $ journalctl --system -u auditd
  $ journalctl --system  -o json-seq --reverse 
  $ journalctl --system --grep "AVC" --reverse

  journalctl -fa _TRANSPORT=audit
  journalctl -fa _TRANSPORT=audit --grep AVC
  journalctl -a _TRANSPORT=audit --grep 'AVC avc:  denied' -o json | pyline -m json 'json.dumps(json.loads(l), indent=2, sort_keys=True)'


Debian does a lot of sandboxing.


To the point where it breaks logind on NIS setups...


any system that starts security by blacklisting instead of whitelisting tends to be doomed by upcoming changes.


Alas, no whitelisting option. A service should start in an empty filesystem root without network access - and if we had something as convenient as pledge() also without any allowed syscalls - and then you could only add what is needed.

firejail does this a bit better but it also started out with a blacklist approach and it's more geared towards desktop application use, not system services.


pledge is excellent, but it protects programmers against writing security bugs that have large impact, it doesn’t protect you against the software they write. It’s those programmers who restrict what their tools can do, and who decide when to throw the switch to enable those restrictions.

If you trust those programmers, it’s indeed way more convenient than other tools, if only because it removes the need for configuring things twice. For example, instead of configuring your web server to serve files from /foo/bar/ _and_ telling SELinux that your web server is allowed to read from /foo/bar, you only configure the web server, and it will tell the OS “I shouldn’t read from anything but /foo/bar, starting … now”.

You’ll have to trust the web server to do that, though.


That's what it is intended for. But pledge has nice properties beyond that which are also useful for external sandboxing. Such as defining easy to understand syscall groups maintained by the kernel as new syscalls are introduced. If linux had that we could for example grant stdio+rpath and not worry about the kernel introducing preadv3 and programs compiled with that getting broken or suboptimal performance when isolated and it would automatically apply to equivalent io_uring implementations block equivalent SQEs too.


The problem of using all-encompassing filters to secure applications is they are crap. Take for example something using Bernstein chaining where permissions are properly separated for each process, with pledge you can restrict access, with global filters it still allows the highest privilege.


What's the problem with firejail? Start with an empty profile, blacklist everything, and whitelist only the stuff you need. It works just fine for server applications, and unlike systemd isolation flags you can setup a proper separate firewall with the `netfilter` option.


Firejail has had multiple sandbox escape vulns in the past. Firejail is an SUID executable in which sandbox escapes can lead to privilege escalation. In contrast, Systemd allows you to run services as unprivileged users, and even create users on demand.

Systemd also supports firewalling: it supports IP address allow/deny policies, ports, etc. For more advanced firewall policies you're probably better off using an actual firewall daemon like firewalld or ufw.


> blacklist everything,

That isn't a whitelist approach.


> blacklist everything, and whitelist only the stuff you need

That _is_ a whitelist approach.


Not really. For example if you have a construct like "blacklist *" and that wildcard is evaluated at construction time on some overlay filesystem then additional entries may sneak in later from the lower over the overlay because the wildcard expansion doesn't get updated.

On the other hand if you start with a blank slate filesystem root and only bind exactly the whitelisted paths then there is nothing to leak through.

There are other ways in which blacklist-all can fail to be equivalent to whitelisting.


That is exactly an allowlist approach.


One of my favorite podcasts, Risky Business[0] regularly plugs Airlock[1]. They seem like they might be the one out front, at least as a paid service.

[0] https://risky.biz/netcasts/risky-business/ [1] https://www.airlockdigital.com/


if you want to test these settings I can recommend `sudo systemd-run -p "DynamicUser=yes" -p "ProtectSystem=yes" -p "ProtectHome=yes" --shell` but be in a readable directory like /tmp or you receive an error.


This is a very handy command in day-to-day work, actually. For example, I use to limit the total amount of memory available to an application, including page cache:

  $ systemd-run --user --scope --property=MemoryHigh=1G qbittorrent
It works just as you'd expect — if qbittorrent's working set goes above 1024 MiB, it pushes the least recently used page out of the page cache. Doesn't really have any effects on upload or download speeds, while helping to keep more useful data in memory.

Many isolation flags are not available in `systemd-run --user`, though, so if you'd like to have some protection you either have to combine `sudo systemd-run` with `su -c`, or wrap the command in firejail.

https://github.com/netblue30/firejail/


I have a bash alias for `make` and `ninja` to do something similar. Just having all the spawned processes in a cgroup helps with system interactivity while building. This works because the kernel will then schedule the whole build as a single unit against the other work on the system, rather than scheduling each process that the build spawns against every other process that I'm running.


Interesting, a few months ago I tried using systemd-run to implement unprivileged memory limits for a process and I'm pretty sure it didn't work with the user manager. Is this a recent addition? (I'm not sure what version of systemd I had at the time.)


Ooh, is this a good way to sandbox execs like ImageMagick or stuff like that?


Use firejail, it's a "one click" solution with prepackaged profiles.

https://github.com/netblue30/firejail/

It uses the same kernel knobs as systemd does, but is more user-friendly and has more features.

I use it for every application that handles data received from other machines: books, images, documents, whatever.


You can also use Bubblewrap, but getting it up and running requires a lot more fiddling around. For example, this is what I use to isolate Zoom from the rest of my system: https://gitlab.com/yorickpeterse/dotfiles/-/blob/0a0492c78b6...

In my case I'm using Bubblewrap because Firejail was only used for Zoom, and this felt a bit of a waste considering Bubblewrap was already installed.


Apache needs to start as `root` but then drops to an non-privileged user. systemd's `User=<user>` can't really express that. Perhaps an option that says a unit needs to be root until the first fork when it has to be a specified user. `ForkUser=apache`


I don't know about httpd specifically, but many applications want root only to be able to bind to a privileged port (like :80). This can be circumvented in one of a few ways:

1. add this to .service

  AmbientCapabilities=CAP_NET_BIND_SERVICE
2. or listen on :8080 and use NAT:

  iptables -t nat -I OUTPUT -p tcp -o lo --dport 80 -j REDIRECT --to-ports 8080
3. or make the port unprivileged

  sysctl -w net.ipv4.ip_unprivileged_port_start=80
It may work for httpd too, I haven't tested it.


Apache also uses the start user to read stuff like TLS private keys, that its normal user does not have access to.


It's possible to remove the root requirement for this through systemd's credentials mechanisms:

https://www.freedesktop.org/software/systemd/man/systemd.exe...


I seriously a few hours trying to find a way to do this and didn't once come across this mechanism, now that I've read it in still not sure how to use it. why is systemd so adverse to writing examples


I find this blogpost is a great example of code paired with systemd configuration for the certs:

https://mgdm.net/weblog/systemd/


Which is the right way to have it link to a proper credential management system via tokens.


And I think its common for the log files to be in /var/log/httpd owned by root but I suppose they could be moved and chown-ed.


Using systemd's LogDirectory= directive will fully take care of ensuring the required directory is present and permissions match the defined User=/Group= of the unit.


The correct Systemd solution would be to create a socket unit but your solutions works without modifying the service code


I can't find anything for an officially supported for Apache or Nginx to support inetd/systemd socket activation bit it certainly would be nice.


I think this requires support from the service, no?

Not everything that wants to open up a port seems to support socket activation. I tried with 6tunnel and couldn't get it to work.


systemd provides a workaround in the form of systemd-socket-proxyd [1]. Granted it copies data, but when the max performance is not required, it works. Sometimes services allow to be configured to listen UNIX socket path, then systemd-socket-proxyd allows to disable ip address access by forwarding the network socket to that.

[1] https://www.freedesktop.org/software/systemd/man/systemd-soc...


Some applications like nginx or php-fpm want to run worker processes under a different user account to isolate them from the main process. In those cases user namespaces are the best option. Too bad PrivateUsers= in systemd unit just does not cover this case.


These are workarounds and not good practices


It only needs to root to bind to privileged ports I believe. You should be able to use a non-root user and give it CAP_NET_BIND_SERVICE:

[Service]

AmbientCapabilities=CAP_NET_BIND_SERVICE


Cool! But then I suppose the forked processes could then bind to a low numbered port - something they can't do now. So Apache would have to make sure to revoke that capability when forking.


You could combine it with something like this

  SocketBindDeny=any
  SocketBindAllow=tcp:80
  SocketBindAllow=tcp:443
These ports should be denied by the kernel because they're already taken by httpd, and all other will be denied by bpf filters installed by systemd.

It feels like plugging holes in a dam, but that's what you do with popular operating systems.


hello.

i added support for httpd to support systemd socket activation in 2013: https://svn.apache.org/viewvc?view=revision&revision=1511033

httpd can start as non-root, assuming other configurations like the access / error logs are writable by the non-root user.


Many applications don't need to bind the port themselves but will happily accept one passed to them during process invocation.

This allows to let systemd to manage ports using socket units which will also stay up and buffer requests when restarting a service, allow service activation on demand/incoming requests or per connection service instances, e.g. for better isolation of sshd's per connection/user.


This is one of the main problems with "whole program sandboxes". Many times a program only needs permissions right at the start and then never again. From the outside though there's no way to signal "OK, I'm done, lock me down" for most sandboxing systems.

One approach that may work with systemd is to have two processes. One would be a broker, running as root. It would grab a port, for example. The other process would be spawned by the broker as a limited service and inherit that port from the parent, with no permissions of its own to open it, only to inherit.

IDK how to express that in systemd-land though. At that point you might be better off just writing the code to sandbox things yourself.


Not meant to be a snarky comment, but a serious question: how does this differ from SELinux?


They are completely different things, and where available should be used together.

SELinux is a policy system where policy is enforced via labels.

Labels are applied to processes which classify what the process is.

Labels are applied to files which define the what classification of process can access the file.

The application of labels happens automatically based on policy. Such policy would include the location of the file or the label of the parent process.

As an example, the default policy for httpd would prevent httpd from accessing /etc/passwd even though the process is running as (or can be) the root user. I believe you could also do interesting things like prevent httpd from opening a socket on a non-standard port if you wanted to.

SELinux is very powerful but complicated. Ideally you use this with distro packages which should have policies already configured for you.

Critically it is not one vs the other. Use both if you have it.


It's vastly simpler, for one thing. SELinux is basically a weird DSL/ programming language for describing system interactions whereas systemd is providing a very basic interface for common restrictions.

I would pretty much never ask a human being to write SELinux policies unless that was explicitly part of their job whereas I can pretty much point any developer to what systemd is providing and they'll be able to work with it.


It seems to be using mostly the linux capabilities: https://man7.org/linux/man-pages/man7/capabilities.7.html

So the overlap choice seems to be more around SELinux versus Capabilities. Where SELinux is more fine-grained and tunable, but more complicated also.


It's not just Linux capabilities; on their own Linux capabilities actually suck majorly and are very limited (AKA "crapabilities"). But systemd also makes extensive usage of cgroups and namespacing facilities to back it up e.g. preventing runaway memory/CPU quotas and stopping applications from accessing paths they shouldn't, restricting network access, stuff like that. Some of this overlaps with SELinux (e.g. restricting file access) but the mechanism is fairly different.

The overlap/comparison between capabilities, systemds features, and selinux features isn't really well defined in any meaningful way IMO. It's really like 5 different features being used in various ways.


I’m curious what you mean by SELinux features not being well-defined? While poorly documented, they are extraordinarily precisely defined, allowing fine-grained control of pretty much everything, all enforced by the kernel with no workarounds, at least in enforcing mode.


SELinux is designed as "mandatory access control," meaning that it is not normally disabled.

The normal filesystem permissions of read/write/execute for user/group/other are among those known as "discretionary access controls," meaning that they can be relaxed.

The systemd unit security options are discretionary, at the control of the administrator.


Is SELinux not also in the administrator's control?


An administrator can disable it complete with "setenforce 0" and restore it with "setenforce 1" if necessary.

The rules can also be adjusted, and there are a number of tunable parameters.

The intent is that it is never disabled.


These days, systemd is better/easier to sandbox _services_ than SELinux. SELinux/AppArmor is still the best way to protect individual GUI and user apps (anything not ran from systemd basically).


I don't have much experience with SELinux, but at least in my org the base policy is to run anything started interactively by the user (or root) in unconfined_t, i.e. with protections disabled.

That is, the same command that gets denied by SELinux through systemd will run fine (and unprotected) when started from a shell.

Do you write your own policies for individual end-user programs?


Why not use both? They are not complementary.


Why would you use SELinux along with systemd? Systemd can do filesystem permissions declaratively vs SELinux having to label the files individually, e.g.:

[Service]

ProtectSystem=strict

ReadWritePaths=/some/path

ReadOnlyPaths=/some/otherpath

InaccessiblePaths=/etc


One can write extraordinarily short FC files using regexp to apply specific SELinux labels as desired, and control access to those labels with only a few rules.

Unlike systemd, they then apply to everything.


Easier, maybe. Better, nope. The breadth and detail available just don't compare, and not in the way where systemd can even touch the scope available to SElinux


SElinux is more capable in theory but so much less usable/discoverable in practice that I suspect anybody who isn't truly dedicated to doing SElinux right will end up averaging better security via the systemd route.

(and I say this based on both observation and personal experience, I have some stuff to harden later this year and I'm really hoping I'll be able to involve somebody who -has- that level of SElinux knowledge but plan B is almost certainly going to be 'mst does his best with the unit configs')


As someone who does a fair amount of SELinux professionally, I’d mostly agree with this: getting started can be daunting, so one could likely get far more value from a short time focusing on systemd security.

But if one can spare the time, SELinux can secure everything, not just systemd services.

It all depends on the threat vectors one faces.


That's why I won't even try to suggest SELinux is easier. It's definitely easier to apply some sandboxing through systemd, but it's pretty coarse grained and mostly seems to hit some relatively easy wins involving capabilities dropping and stuff that is often hidden deep inside PAM. Good start, but I wouldn't call it "better" ultimately.


Can you expand on that? In my opinion, systemd has far more controls for process security over SELinux (networking, cgroups, nspawn sandboxing, etc).


Out of those, the only things that aren't covered by SELinux are things that would be expected to be set by wrapper/launcher process (modifying namespaces - which covers nspawn and setting cgroups). Everything else, i.e. actual run-time access decisions, is more fine grained and controllable through SELinux, including level of access control like whether a program can listen on a socket or bind a socket, while still permitting it to connect.


Whose gonna write THE holy-grail of analyzer of many executables to determine what Linux capabilities, cgroups, and syscalls are just being referenced?

Caveat: it has to dig into ALL the linked libraries as well.


Is there any advice for working with older systemd versions? Right off the bat, systemd 237 is out because there is no security feature for that version of systemd-analyze.


Use the same config you'd use for the latest systemd version. It will ignore flags it doesn't know (and warn you in unit logs).


Great article :) thank you!


can you limit outbound network access to specified masks/ports/devices on a per-service level?


Does your box still touch local dns before connecting to VPN? No?

Then anything with systemd and security can stuff it.


What?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: