We've been developing the BlueWave Uptime Manager [1] for the past 5 months with a team of 7 developers and 3 external contributors, and till today we always went under the radar.
As we move towards expanding from basic uptime tracking to a comprehensive monitoring solution, we're interested in getting insights from the community.
For those of you managing server infrastructure,
- What are the key assets you monitor beyond the basics like CPU, RAM, and disk usage?
- Do you also keep tabs on network performance, processes, services, or other metrics?
Additionally, we're debating whether to build a custom monitoring agent or leverage existing solutions like OpenTelemetry or Fluentd.
- What’s your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
- Lastly, what’s your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system?
[1] https://github.com/bluewave-labs/bluewave-uptime
* Network is another basic that should be there
* Average disk service time
* Memory is tricky (even MemAvailable can miss important anonymous memory pageouts with a mistuned vm.swappiness), so also monitor swap page out rates
* TCP retransmits as a warning sign of network/hardware issues
* UDP & TCP connection counts by state (for TCP: established, time_wait, etc.) broken down by incoming and outgoing
* Per-CPU utilization
* Rates of operating system warnings and errors in the kernel log
* Application average/max response time
* Application throughput (both total and broken down by the error rate, e.g. HTTP response code >= 400)
* Application thread pool utilization
* Rates of application warnings and errors in the application log
* Application up/down with heartbeat
* Per-application & per-thread CPU utilization
* Periodic on-CPU sampling for a bit of time and then flame graph that
* DNS lookup response times/errors
> Do you also keep tabs on network performance, processes, services, or other metrics?
Per-process and over time, yes, which are useful for post-mortem analysis