
Troubleshooting Systemd with SystemTap - lelf
https://blog.janestreet.com/troubleshooting-systemd-with-systemtap/
======
jka
There's a chance that there could be a useful intersection between 'mature,
useful but risky troubleshooting tool' (SystemTap) and 'modern, resource-
bounded tracing interface' (BPF) here.

SystemTap has implemented prototype support[1] for BPF, although I'd expect
it's not yet widely used or tested.

BPF itself has had a bit of a history of security issues[2] but it appears to
be stabilizing, and can theoretically be a lot safer than approaches like
SystemTap which require runtime kernel-level code to be inserted.

[1] -
[https://sourceware.org/git/?p=systemtap.git;a=blob;f=README;...](https://sourceware.org/git/?p=systemtap.git;a=blob;f=README;h=e14cdf97abe13a470d9fa674a34fb7bc5f49ba3c;hb=HEAD#l110)

[2] - [https://cve.mitre.org/cgi-
bin/cvekey.cgi?keyword=BPF](https://cve.mitre.org/cgi-
bin/cvekey.cgi?keyword=BPF)

~~~
znpy
The problem is probably Ubuntu.

Systemtap works beautifully under rhel/centos, but it's a pain to make it work
on Ubuntu.

This has slown down adoption by a lot.

We could have had dynamic tracing in gnu/Linux a lot earlier if Ubuntu had
done a better job at supporting it.

This is no surprise though: canonical has often refused to adopt stuff that
have been thoroughly tested and vetted in the rhel world: think of firewalld
Vs ufw. Or selinux Vs apparmor. Or cockpit, that has no equivalent in the
canonical world afaik.

~~~
simion314
>This is no surprise though: canonical has often refused to adopt stuff that
have been thoroughly tested and vetted in the rhel world: think of firewalld
Vs ufw. Or selinux Vs apparmor. Or cockpit, that has no equivalent in the
canonical world afaik.

Ubuntu is baded on Debian, if Red Hat stuff is not in Debian is hard to blame
Canonical and not Debian but let me know this mental gymnastics of blame
shifting

~~~
IntelMiner
Ubuntu is based on Debian, but /is/ not Debian

It's worth noting that Canonical also has a pretty well-known history of being
a "bad player" in the wider community. They rarely upstream patches, break
things and very much try to "do their own thing"

The more cynical of us think its Canonical attempting to segment Ubuntu off
from the wider Linux ecosystem to force some kind of "vendor lock-in" by
virtue of being the "newbie friendly" Linux option

Some of the more egregious examples of things they forked were "small" pieces
of the stack. Like the entire fucking desktop environment (Unity) and display
server (Mir). Relying on things like Ubuntu-specific forks of GTK and other
critical components, which is why those technologies never appeared in other
distributions

Anecdotally I've also been told by people who work at Red Hat that a lot of
Canonical employees were picked up about a year ago after being terminated
(allegedly) for attempting to upstream the code they worked on

Feel free to wave it off as "mental gymnastics" or "blame shifting" but I do
genuinely believe that Canonical has an intentionally poisonous effect on the
wider Linux ecosystem

~~~
simion314
Unity was not a fork but a new project, so use a bit of logic and less emotion
and honestly answer this "Who forked GNOME2 and GNOME3 ? " because you have a
ton of GNOME forks because but if Canonical does it's own DE is evil but if
someone forks MAte,Elementary,Cinemon etc is cool.

I think unless Canonical removed a lot of RedHat tech from Debian and replaced
it with their own then you are just complaining that Debian is not a big fan
of RH tech.

------
qplex
Has anyone had positive experiences with systemd? Success stories because of
feature x?

I've pretty grim personal opinion of it, based on similiar experiences as this
story. I'd rather just use a simpler approach (another init) that _works_, and
doesnt involve hours of debugging. I've never had any troubles with sysv-style
init for example.

~~~
jandrese
When Systemd is working it is a lot more regular than the old init systems.
You have nice tools like nmcli that combine the functionality of literally
dozens of programs into a single interface.

The problem comes when you discover that the Systemd version of a service is
missing one crucial feature you need or is buggy, so you need to run an
external service instead and it starts fighting you tooth and nail. Or when
something breaks and tracking down the issue is hindered by it being hidden in
behind a bunch of layers of indirection.

For example, try configuring an Ubuntu 16 box to randomize its MAC address
before searching for WiFi access points.

~~~
qplex
>randomize its MAC address before searching for WiFi access points

Indeed, things like start order and depedencies can be completely broken on
systemd. Many problems arise from incompatibilities with legacy systems and
interfaces.

I had a big headache last time I had to configure systemd to wait for dhcp
before running other stuff because the systemd targets for networking were not
working - part of the problem was a legacy interface for network
configuration.

------
pas
Before bringing in the big guns, I'd recommend just whipping out strace -p 1
-s 1000 -tt -T (or something like that). Chances are it'll show something
that'll get you closer to solving whatever issue you have.

SystemTap and other kernel instrumentation stuff are of course invaluable
sometimes, but I just find strace easier, even if it requires looking at and
divining meaning from a lot of mystical output lines. :)

That said, the amount of processes, DBus messages, and system calls that
happen on even on an idle system these days is crazy. (On Windows it's even
worse, something is doing I/O all the time. Diagnostic that, ETW this [event
tracing for windows], you can't stop most of them.) There is just too much
broadcasting going on. And it's especially maddening on platforms that provide
subscription for events. (Don't broadcast the same signal over and over. Let
new clients get the current state and wait.)

~~~
jandrese
Strace is of no help in discovering why your bootup is stalled out because one
of the processes didn't receive a message it was expecting. Who was supposed
to emit the message? Who knows? Why wasn't it sent? Sucks to be you. Unlike
the old init days you can't just look at the previous lines in the script to
figure out what went wrong, you have to understand the entire system before
you can start debugging it.

~~~
pas
Have you encountered that? Since systemd I have more confidence in issuing
reboots, but of course if it goes sideways it seems harder to debug, exactly
because I have not much experience with that part.

I enabled various debug levels for it (it can spew out an ungodly amount of
data, but exactly what you mentioned is missing, the "what are we waiting for"
info) - I had a problem with shutdown taking too long.

------
nisa
while this writeup is really cool I'm still asking myself as a sysadmin if all
the complexity in the wider systemd ecosystem is a requirement to solve modern
requirements or if it would be possible to have an reactive event based system
that's easier to debug/introspect. Despite working on Linux systems for more
than 10 years I wouldn't be able to debug such an issue like the author did in
the article and I doubt most other sysadmin have the time and patience and
knowledge to deep dive into the system for debugging something that shouldn't
be a problem in the first place...

That beeing said is there a good a writeup / deep dive into systemd concepts?
Really cool would be something like a book - the official docs are quite
spread out and it's either rather simple high-level stuff or basically: read
the source.

------
znpy
Systemtap is pure gold.

Too bad it basically doesn't work in Ubuntu.

~~~
tankenmate
don't the instructions at
[https://wiki.ubuntu.com/Kernel/Systemtap](https://wiki.ubuntu.com/Kernel/Systemtap)
work? i'm not asking sarcastically, i haven't tried the instructions but i
think i will try it in the future.

~~~
znpy
Last time I tried (January/February IIRC) no, they did not.

------
xfs
So, accidentally quadratic?

