Do you guys expose Prometheus metrics or how is monitoring supposed to be done (...

superboum · on Feb 8, 2022

Currently, we have no elegant way to achieve what you want.

When failures occur, repair is done through workers that says when they launch, when they repair chunks, and when they exit in the logs. We also have `garage status` and `garage stats`. The first command displays healthy and non healthy nodes, the second one displays the queue length of our tables and chunks, if their values are greater than zero, we are repairing the cluster. We are documenting failure recovery in our documentation: https://garagehq.deuxfleurs.fr/documentation/cookbook/recove...

For the near future, we plan to integrate opentelemetry. But we are still discussing the design and information we want to track and report. We are currently discussing these questions in our issue tracker: https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/111 https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/207

If you have some knowledge/experience on this subject, feel free to share it in these issues.

lxpz · on Feb 8, 2022

Exposing Prometheus metrics is an ongoing work which we haven't had much time to advance yet. For now we can check on replication progress or overall cluster health by inspecting Garage's state from the command line.