
Ask HN: Any books/resources on logging and monitoring? - hgl
I’ve been running a few distributed web services, but except for a few rudimentary nginx access logs, I have no idea how they’re functioning.<p>I have the following goals and questions regarding implementing a logging &amp; monitoring system to get better insights of them:<p>- What are the best practices to instrument source code to collect general logs and exceptions?
- How to determine if the services and databases are performing efficiently? More specifically, what I can do to discover if they are doing unnecessary work or there are any hotspots?
- Are the servers being run on overloaded? If so, what are overloading them?
- How do I know if some one is trying to break into the servers?
- How can I be alerted whenever a bad thing previously mentioned happens?<p>And then there is the business logic side of things. like how many users are online, how many transactions are currently being processed, etc. I don’t suppose directly querying the production database is a good idea.<p>My own research online surfaced a great deals of tools like prometheus, ELK stack, fluentd, Nagios, bugsnag, New Relic, Datadog, etc, which overwhelmed me, and I reckon without a good understanding of logging and monitoring in general, I’m likely to pick the wrong tools.<p>This feels like a really big topic. Any books&#x2F;resources that have a comprehensive introduction?
======
sid-
[https://dzone.com/articles/distributed-tracing-with-
zipkin-a...](https://dzone.com/articles/distributed-tracing-with-zipkin-and-
elk)
[https://github.com/jaegertracing/jaeger'](https://github.com/jaegertracing/jaeger')
A quick search yielded these interesting projects.

~~~
hgl
Thanks, but as mentioned in the post, without a good understanding it's really
difficult to choose tools.

I'm looking for materials that put these tools into context & perspective so
we can make a more informed judgement call.

------
jdale27
The Google SRE book (online here: [https://landing.google.com/sre/sre-
book/toc/index.html](https://landing.google.com/sre/sre-book/toc/index.html))
might be useful. Specifically chapters 6 and 10 on monitoring and alerting.

~~~
hgl
This seems like a pretty comprehensive book. Will check out. Thanks.

