I am using systemd on my machine and try to configure most things through it. For example, I have a backup job that is triggered by a timer.
I want to know when that job fails so I can investigate and fix it. Over time, I've had multiple solutions for this:
Send a notifcation via notify-send
Add `systemctl --failed` to my shell startup script
Send myself emails
None of these are quite ideal. Notifications are disruptive of the current workflow and ephemeral, meaning I might forget about it if I don't deal with it immediately.
Similarly, reading `systemctl --failed` on every new terminal is also disruptive but at least it makes me not forget about it.
Both of these are also not really applicable to server systems.
Sending myself emails feels a bit wrong but has so far been the best solution.
How are other people solving this? I did some research and I am surprised that there isn't a more rounded solution. I'd expect that pretty much every Linux user must run into this problem.
Long answer: Whenever I've started to add alerting and monitoring to a system, I end up wanting to add more things each time, so I find it valuable to start from the beginning with an extensible system. For me, Prometheus has been the best option: easy to configure, lightweight, doesn't even need to run in the host, and can monitor multiple systems. You just have to configure which exporters you want it to pull data from. In this case, prometheus_node_exporter has a massive amount of stats about a system (including SystemD), and there are default alarms and dashboards out there that will help you create basic monitoring in a minute.
You can choose to use Grafana for visualization, and then either the integrated Grafana alerting or use the Prometheus alerting + Prometheus Alertmanager. I think in the latest versions Grafana Alerting includes basically an embedded AlertManager so it should have the same features.
Regarding the type of alert itself, I send myself mails for the persistence/reminders + Telegram messages for the instant notifications. I find it the best option tbh.