This is a getting started/101 introduction; it also talks about and recommends systemd-analyze security. There’s a link to part two at the bottom of the article that goes deeper into things.
I haven't seen any difference between distributions with the same systemd version. Anything with a recent one should do fine. More recent than RHEL8, mind you (which is on systemd 239): for example, a syscall allow/deny analysis is buggy there and asks you to enable some protections, and then disable them. The same unit is analyzed correctly on my desktop with v250 (I use the popular rolling release distribution).
I haven't seen anything like audit2allow. It's probably not especially necessary because of the difference in philosophies: SELinux is deny by default, while in systemd you're playing whack-a-mole anyway, and are expected to add directives one by one until the application stops working. Unit logs usually make it obvious if something was denied.
The usual way I've seen (and do myself) is to just let the process be killed and have its coredump taken, then `coredumpctl gdb $process_name -A '-ex "print $rax" -ex "quit"'` to get the syscall number, then check `systemd-analyze syscall-filter` for whether I want to allow just that one syscall or the whole group it's in.
> The usual way I've seen (and do myself) is to just let the process be killed and have its coredump taken, then `coredumpctl gdb $process_name -A '-ex "print $rax" -ex "quit"'` to get the syscall number, then check `systemd-analyze syscall-filter` for whether I want to allow just that one syscall or the whole group it's in.
Another approach would be to set SystemCallLog= to be the opposite of SystemCallFilter= (negate each group with ~) and then you'll see the call (and caller) in the journal.
At least on my distro (OpenSUSE Tumbleweed, systemd 249, kernel 5.16.0), SystemCallLog doesn't fire for calls that are filtered; the process is killed first. Even if I set SystemCallErrorNumber=EPERM I don't see the audit log generated. The log only gets generated if the syscall wasn't filtered.
FWIU, e.g. sysdig is justified atop whichever MAC system.
In the SELinux MAC system on RHEL and Debian, in /etc/config/selinux, you have SELINUXTYPE=minimal|targeted|mls. RHEL (CentOS and Rocky Linux) and Fedora have SELINUXTYPE=targeted out-of-the-box. The compiled rulesets in /etc/selinux/targeted are generated when [...].
With e.g gnome-system-monitor on a machine with SELINUX=permissive|enforcing, you can right-click the column header in the process table to also display the 'Security context' column that's also visible with e.g. `ps -Z`. The stopdisablingselinux video is a good SELinux tutorial.
I'm out of date on Debian/Ubuntu's policy set, which could also probably almost just be sed'ed from the current RHEL policy set.
> * SELinux is deny by default, while in systemd you're playing whack-a-mole anyway, and are expected to add directives one by one until the application stops working. Unit logs usually make it obvious if something was denied.*
DENY if not unconfined is actually the out-of-the-box `targeted` config on RHEL and Fedora. For example, Firefox and Chrome currently run as unconfined processes. While decent browsers do do their own process sandboxing, SELinux and/or AppArmor and/or 'containers' with a shared X socket file (and drop-privs and setcap and cgroups and namespaces fwtw) are advisable atop really any process sandboxing?
Given that the task is to generate a hull of rules that allow for the observed computational workload to complete with least-privileges, if you enable like every rule and log every process hitting every rung on the way down while running integration tests that approximate the workload, you should end up with enough rule violations in the log to even dumbly generate a rule/policy set without the application developer's expertise around to advise on potential access violations to allow.
> "Sysdig instruments your physical and virtual machines at the OS level by installing into the Linux kernel and capturing system calls and other OS events. Sysdig also makes it possible to create trace files for system activity, similarly to what you can do for networks with tools like tcpdump and Wireshark.
Alas, no whitelisting option. A service should start in an empty filesystem root without network access - and if we had something as convenient as pledge() also without any allowed syscalls - and then you could only add what is needed.
firejail does this a bit better but it also started out with a blacklist approach and it's more geared towards desktop application use, not system services.
pledge is excellent, but it protects programmers against writing security bugs that have large impact, it doesn’t protect you against the software they write. It’s those programmers who restrict what their tools can do, and who decide when to throw the switch to enable those restrictions.
If you trust those programmers, it’s indeed way more convenient than other tools, if only because it removes the need for configuring things twice. For example, instead of configuring your web server to serve files from /foo/bar/ _and_ telling SELinux that your web server is allowed to read from /foo/bar, you only configure the web server, and it will tell the OS “I shouldn’t read from anything but /foo/bar, starting … now”.
You’ll have to trust the web server to do that, though.
That's what it is intended for. But pledge has nice properties beyond that which are also useful for external sandboxing. Such as defining easy to understand syscall groups maintained by the kernel as new syscalls are introduced. If linux had that we could for example grant stdio+rpath and not worry about the kernel introducing preadv3 and programs compiled with that getting broken or suboptimal performance when isolated and it would automatically apply to equivalent io_uring implementations block equivalent SQEs too.
The problem of using all-encompassing filters to secure applications is they are crap. Take for example something using Bernstein chaining where permissions are properly separated for each process, with pledge you can restrict access, with global filters it still allows the highest privilege.
What's the problem with firejail? Start with an empty profile, blacklist everything, and whitelist only the stuff you need. It works just fine for server applications, and unlike systemd isolation flags you can setup a proper separate firewall with the `netfilter` option.
Firejail has had multiple sandbox escape vulns in the past. Firejail is an SUID executable in which sandbox escapes can lead to privilege escalation. In contrast, Systemd allows you to run services as unprivileged users, and even create users on demand.
Systemd also supports firewalling: it supports IP address allow/deny policies, ports, etc. For more advanced firewall policies you're probably better off using an actual firewall daemon like firewalld or ufw.
Not really. For example if you have a construct like "blacklist *" and that wildcard is evaluated at construction time on some overlay filesystem then additional entries may sneak in later from the lower over the overlay because the wildcard expansion doesn't get updated.
On the other hand if you start with a blank slate filesystem root and only bind exactly the whitelisted paths then there is nothing to leak through.
There are other ways in which blacklist-all can fail to be equivalent to whitelisting.
if you want to test these settings I can recommend `sudo systemd-run -p "DynamicUser=yes" -p "ProtectSystem=yes" -p "ProtectHome=yes" --shell`
but be in a readable directory like /tmp or you receive an error.
This is a very handy command in day-to-day work, actually. For example, I use to limit the total amount of memory available to an application, including page cache:
It works just as you'd expect — if qbittorrent's working set goes above 1024 MiB, it pushes the least recently used page out of the page cache. Doesn't really have any effects on upload or download speeds, while helping to keep more useful data in memory.
Many isolation flags are not available in `systemd-run --user`, though, so if you'd like to have some protection you either have to combine `sudo systemd-run` with `su -c`, or wrap the command in firejail.
I have a bash alias for `make` and `ninja` to do something similar. Just having all the spawned processes in a cgroup helps with system interactivity while building. This works because the kernel will then schedule the whole build as a single unit against the other work on the system, rather than scheduling each process that the build spawns against every other process that I'm running.
Interesting, a few months ago I tried using systemd-run to implement unprivileged memory limits for a process and I'm pretty sure it didn't work with the user manager. Is this a recent addition? (I'm not sure what version of systemd I had at the time.)
Apache needs to start as `root` but then drops to an non-privileged user. systemd's `User=<user>` can't really express that. Perhaps an option that says a unit needs to be root until the first fork when it has to be a specified user. `ForkUser=apache`
I don't know about httpd specifically, but many applications want root only to be able to bind to a privileged port (like :80). This can be circumvented in one of a few ways:
I seriously a few hours trying to find a way to do this and didn't once come across this mechanism, now that I've read it in still not sure how to use it. why is systemd so adverse to writing examples
Using systemd's LogDirectory= directive will fully take care of ensuring the required directory is present and permissions match the defined User=/Group= of the unit.
systemd provides a workaround in the form of systemd-socket-proxyd [1]. Granted it copies data, but when the max performance is not required, it works. Sometimes services allow to be configured to listen UNIX socket path, then systemd-socket-proxyd allows to disable ip address access by forwarding the network socket to that.
Some applications like nginx or php-fpm want to run worker processes under a different user account to isolate them from the main process. In those cases user namespaces are the best option. Too bad PrivateUsers= in systemd unit just does not cover this case.
Cool! But then I suppose the forked processes could then bind to a low numbered port - something they can't do now. So Apache would have to make sure to revoke that capability when forking.
Many applications don't need to bind the port themselves but will happily accept one passed to them during process invocation.
This allows to let systemd to manage ports using socket units which will also stay up and buffer requests when restarting a service, allow service activation on demand/incoming requests or per connection service instances, e.g. for better isolation of sshd's per connection/user.
This is one of the main problems with "whole program sandboxes". Many times a program only needs permissions right at the start and then never again. From the outside though there's no way to signal "OK, I'm done, lock me down" for most sandboxing systems.
One approach that may work with systemd is to have two processes. One would be a broker, running as root. It would grab a port, for example. The other process would be spawned by the broker as a limited service and inherit that port from the parent, with no permissions of its own to open it, only to inherit.
IDK how to express that in systemd-land though. At that point you might be better off just writing the code to sandbox things yourself.
They are completely different things, and where available should be used together.
SELinux is a policy system where policy is enforced via labels.
Labels are applied to processes which classify what the process is.
Labels are applied to files which define the what classification of process can access the file.
The application of labels happens automatically based on policy. Such policy would include the location of the file or the label of the parent process.
As an example, the default policy for httpd would prevent httpd from accessing /etc/passwd even though the process is running as (or can be) the root user.
I believe you could also do interesting things like prevent httpd from opening a socket on a non-standard port if you wanted to.
SELinux is very powerful but complicated. Ideally you use this with distro packages which should have policies already configured for you.
Critically it is not one vs the other. Use both if you have it.
It's vastly simpler, for one thing. SELinux is basically a weird DSL/ programming language for describing system interactions whereas systemd is providing a very basic interface for common restrictions.
I would pretty much never ask a human being to write SELinux policies unless that was explicitly part of their job whereas I can pretty much point any developer to what systemd is providing and they'll be able to work with it.
It's not just Linux capabilities; on their own Linux capabilities actually suck majorly and are very limited (AKA "crapabilities"). But systemd also makes extensive usage of cgroups and namespacing facilities to back it up e.g. preventing runaway memory/CPU quotas and stopping applications from accessing paths they shouldn't, restricting network access, stuff like that. Some of this overlaps with SELinux (e.g. restricting file access) but the mechanism is fairly different.
The overlap/comparison between capabilities, systemds features, and selinux features isn't really well defined in any meaningful way IMO. It's really like 5 different features being used in various ways.
I’m curious what you mean by SELinux features not being well-defined? While poorly documented, they are extraordinarily precisely defined, allowing fine-grained control of pretty much everything, all enforced by the kernel with no workarounds, at least in enforcing mode.
SELinux is designed as "mandatory access control," meaning that it is not normally disabled.
The normal filesystem permissions of read/write/execute for user/group/other are among those known as "discretionary access controls," meaning that they can be relaxed.
The systemd unit security options are discretionary, at the control of the administrator.
These days, systemd is better/easier to sandbox _services_ than SELinux. SELinux/AppArmor is still the best way to protect individual GUI and user apps (anything not ran from systemd basically).
I don't have much experience with SELinux, but at least in my org the base policy is to run anything started interactively by the user (or root) in unconfined_t, i.e. with protections disabled.
That is, the same command that gets denied by SELinux through systemd will run fine (and unprotected) when started from a shell.
Do you write your own policies for individual end-user programs?
Why would you use SELinux along with systemd? Systemd can do filesystem permissions declaratively vs SELinux having to label the files individually, e.g.:
One can write extraordinarily short FC files using regexp to apply specific SELinux labels as desired, and control access to those labels with only a few rules.
Easier, maybe. Better, nope. The breadth and detail available just don't compare, and not in the way where systemd can even touch the scope available to SElinux
SElinux is more capable in theory but so much less usable/discoverable in practice that I suspect anybody who isn't truly dedicated to doing SElinux right will end up averaging better security via the systemd route.
(and I say this based on both observation and personal experience, I have some stuff to harden later this year and I'm really hoping I'll be able to involve somebody who -has- that level of SElinux knowledge but plan B is almost certainly going to be 'mst does his best with the unit configs')
As someone who does a fair amount of SELinux professionally, I’d mostly agree with this: getting started can be daunting, so one could likely get far more value from a short time focusing on systemd security.
But if one can spare the time, SELinux can secure everything, not just systemd services.
That's why I won't even try to suggest SELinux is easier. It's definitely easier to apply some sandboxing through systemd, but it's pretty coarse grained and mostly seems to hit some relatively easy wins involving capabilities dropping and stuff that is often hidden deep inside PAM. Good start, but I wouldn't call it "better" ultimately.
Out of those, the only things that aren't covered by SELinux are things that would be expected to be set by wrapper/launcher process (modifying namespaces - which covers nspawn and setting cgroups). Everything else, i.e. actual run-time access decisions, is more fine grained and controllable through SELinux, including level of access control like whether a program can listen on a socket or bind a socket, while still permitting it to connect.
Whose gonna write THE holy-grail of analyzer of many executables to determine what Linux capabilities, cgroups, and syscalls are just being referenced?
Caveat: it has to dig into ALL the linked libraries as well.
Is there any advice for working with older systemd versions? Right off the bat, systemd 237 is out because there is no security feature for that version of systemd-analyze.
https://news.ycombinator.com/item?id=29976096
Or simply follow whatever `systemd-analyze security` recommends, just make sure you run it on a system with recent systemd.