Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Homelab Monitoring Setup with Grafana (randombits.host)
155 points by conor_f on June 7, 2023 | hide | past | favorite | 83 comments



I self host for years about 30 services, out of these 3 are vital (bitwarden, home assistant and pihole).

I work in IT, I am a geek so I tried a few monitoring systems and wrote two myself.

Then I realized that I have self-sustaining, 24/7 monitoring agents: wife and children.

I gave up trying to have the right stack and just wait for them to yell.

Seriously: it works great and it made me wonder WHY I am trying to monitor. Turns out this is more for the fun, discovery of tools than a real need at home.


Reminds me of the (possibly apocryphal) monitoring that was in place when Healthcare.gov was launched: they had a TV tuned to the news, and the news would tell them whenever the site crashed!


Oh hey, that was basically my job! After getting out of grad school, I worked overnight for an HHS contractor writing media summaries for the Secretary. Every night, we would gather all the news stories about the topics HHS told us they wanted to track. At 6:00am, we'd publish the Secretary's briefing and at 7:30am we'd publish the org-wide briefing. To this day, I possess more trivia about the Healthcare.gov launch than I will ever need for any conceivable reason.


I monitor so that when there is a problem I have some data I can use to troubleshoot the problem and identify possible solutions.


This is the first time I've heard a parent refer to their currently-living-at-home child as "self-sustaining".


Usually that means they know how to microwave their own hot pockets and open a can of spaghetti-Os.


Fortunately we live in France so we have less of these monstruorities :)


But then how do you fatten your children to keep them docile and slow? :)

Different parenting styles, I guess.


It depends on the age.

When they were young they were definitely not self sustaining.

As teenagers they now live on food (either provided when it meets their standards, or the one they cook themselves), water and wi-fi.


This confirms to me what I suspected when I was trying to determine whether to host my own Grafana stack or use the Grafana Cloud free tier - that I'd end up spending a ton of time fiddling with a constellation of services I didn't actually care about that I could spend on the projects and services I do care about.

I've not found it too hard to stay within the limits of the free tier. The 10 dashboards limit is the main one that actually constrains me, but I just put more stuff on each dashboard and live with the scrolling. The free retention is not great but it's good enough for my purposes.


IIRC grafana cloud requires to use their importer which was a no-start for me.

Also 14 days retention is not useful for home, I want to know temperature and power stats from last winter, not from last 2 weeks.

Even the "first paid" tier contains only 13 months of retention

I just used VictoriaMetrics all-in-one binary for home stuff + grafana as visualisation


I use Grafana Cloud with OpenTelemery without problems.


What database do you use for storing metrics?


If you are running Kubernetes in your homelab then, for better or for worse, the Prometheus helm chart abstracts all of this away. The default Helm values worked perfectly for me to gather metrics from my cluster and make a quick dashboard in Grafana. Other than increasing the default size of the Prometheus storage volume and configuring the node exporter for a non-Kubernetes host I wanted metrics from, I didn't have to touch anything.


Alternatively the 3 chosen tools to ship metrics and logs (CAdvisor, promtail, node-exporter) could be replaced with an all-in-one tool such as Vector or Telegraf. If you wanted to slim it down further, Netdata accomplishes what those 3 tools and Prometheus can do in a really nice UI.

If the poster hosted those services in a single node k3s or something, the kube-prometheus-stack helm chart is able to deploy a lot of those tools easily.


> This confirms to me what I suspected when I was trying to determine whether to host my own Grafana stack or use the Grafana Cloud free tier - that I'd end up spending a ton of time fiddling with a constellation of services I didn't actually care about that I could spend on the projects and services I do care about.

This. Although it can be fun to learn, I've done that, got the t-shirt (literally from a conference)


From my experience once setup it's pretty much "touch never aside from updates" type of deal. I had one stack based off influxdb for 5+ years, now changed the backend for victoriametrics (mostly because a lot of stuff supports prometheus-likes), and again, not much touching after setup.

But I did similar stuff for work so I already had the skills.


It's more moving part than it should be, but it's not so bad - I set up an equivalent locally in one evening. Sure beats keeping my private server logs off-site.


I'm in the process of building out a Grafana stack (Prometheus, Loki, Tempo, Mimir, Grafana) for my day job right now.

...and also for one of my side projects, OSRBeyond.

It's easy to get overwhelmed by all the moving pieces, but it's also a lot of _fun_ to set up.


> It's easy to get overwhelmed by all the moving pieces

Exactly my thoughts! Isn't there something (open source and as good as Prometheus+Grafana) that doesn't have as many moving parts as the stack used by OP? I can imagine there are many use cases for that: from side projects (homelabs) to small startups that don't have huge distributed systems, but still need monitoring (without relying on third-parties).

Ideally, my setup would be:

- install an agent in each server I'm interesting in gathering metrics from. In this regard, Prometheus works just fine

- one service to handle logs/metrics/traces ingestion and that allows you to search and visualize your stuff in nice dashboards. Grafana works, but it doesn't support logs and traces out of the box (you need Loki for that)

So, basically 2 pieces of software (if they can be installed by just dropping a binary, even better)


I think there's nothing currently that combines both logging and metrics into one easy package and visualizes it, but it's also something I would love to have.

Vector[1] would work as the agent, being able to collect both logs and metrics. But the issue would then be storing it. I'm assuming the Elastic Stack might now be able to do both, but it's just to heavy to deal with in a small setup.

A couple of months ago I took a brief look at that when setting up logging for my own homelab (https://pv.wtf/posts/logging-and-the-homelab). Mostly looking at the memory usage to fit it on my synology. Quickwit[2] and Log-Store[3] both come with built in web interfaces that reduce the need for grafana, but neither of them do metrics.

- [1] https://vector.dev - [2] https://quickwit.io/ - [3] https://log-store.com/


Nice experiment.

Side note: it should be possible to tweak some config parameters to optimize the memory usage or cpu usage of quickwit. Ask us on the discord server next time :)


Thanks!

Yeah, I was a little bit surprised it was so close. And I've been using tantivy (the library which powers quickwit afaik) in another side project where it used comparatively less.

Might jump in there then as an excuse to fiddle a bit more with the homelab soon then :)


You should try OpenObserve. It combines logging, metrics and dashboards (and traces) in one single binary/container



Does using vector commit you to DataDog in any way?


Not at all.


I use Telegraf (collector) + Influx (storage) + Grafana (visualization and alerting). Telegraf is amazingly simple to use for collection and has a ton of plugins available.


I also started with that stack, but swapped out InfluxDB for Postgres + TimescaleDB extension, which adds timeseries workflows (transparent partitioning, compression, data retention, continuous aggregates, …).

I found InfluxDB to be lacking in terms of permissions management, flexibility regarding queries (SQL, joins), data retention, ability to debug problems. In Postgres, for example, I can look into the execution plan of a statement, log long running queries, and so on.

Telegraf as an agent is very flexible; it has input plugins for every task I could want, and besides it’s default „pull workflow“ (checks on defined interval) I also like to push new metrics directly to the Telegraf inputs.socket plugin from my scripts (backup stats, …).


How do you get data from Telegraf into Postgres/TimescaleDB?

I was interested in swapping out InfluxDB, but it turned out to be somewhat difficult to send data from Telegraf to Postgres. It's not as simple as making an HTTP post, like you can do with InfluxDB.


+1 for Telegraf (with Prometheus and Grafana), rolled out a monitoring stack for our internal network in something like 2 days when a colleague had been manually checking `top` for years each morning. Huge benefit.


My simple-as-dirt solution is generally to use InfluxDB + Grafana. InfluxDB provides a nice HTTP interface that all of my devices simply POST to. I write all the queries myself, because I find that it's a heck of lot easier than to track down individual agents/plugins that actually work.


The closest thing may well be Elasticsearch (and Kibana for visualisation), if you are fine with the Elastic license. As its document format is very flexible, it can be used to store logs, metrics, and traces. It'll be a solution inferior to specialised tools like Prometheus, Mirmir, Tempo, though. And some may be put off by the difficulty of running Elasticsearch.

Alternatives could be other general purpose databases.


Telegraf is a single agent that collects a nice amount of metrics and send it to many databases. I prefer to use telegraf (and scripts) to collect the metrics into influxdb and then grafana.

Telegraf have some log parsing/extraction functionality, but for something more generic promtail+loki would be better.


> Grafana works, but it doesn't support logs and traces out of the box

Grafana doesn't support anything out of the box by that logic. Before you get any viz in Grafana you have to add a data source cmon.


I've had great success using this helm chart to install the entire stack into my EKS clusters. Even if you're not using Kubernetes, it's still a useful example for how everything should fit together. https://github.com/prometheus-community/helm-charts/tree/mai...

Good luck! It's a lot.


Ditto. I've recently just completed a migration from Thanos to Mimir, and I've found that its much easier to operate and administrate. Still stuck on Elastic for logs but I'm slowly convincing developers Loki can be just as effective.


I've found VictoriaMetrics all-in-one binary to be perfect size for home at the very least for metrics gathering.

Supports Prometheus querying and few other formats for ingesting so any knowledge bout "how to get data into prometheus" applies pretty much 1:1 + their own vmagent is pretty advanced. Not related to company in any way, just a happy user.

https://victoriametrics.com/


I’ll never understand how companies with a UI focused product end up with websites that don’t have any screenshots of the UI. I spent over a minute on the site and couldn’t find a screenshot.


Could you explain why you had that thought? AFAIK VictoriaMetrics is a backend product that is positioned as an alternative to Prometheus/Graphite/InfluxDB/OpenTSDB, perfectly works with Grafana. It has its datasource for Grafana and yes, it has integrated VMUI which you can try at https://play.victoriametrics.com.


The article is about Grafana which is a front-end.

> all-in-one binary

In a topic about Grafana, I expected it to include a front-end if it's all-in-one. To me, all-in-one means everything, not just the database or back-end.


It’s a database


I haven’t started using it yet but i identified Victoria metrics as the first time series database I would try as a replacement for our wonderware historian so I won’t have to use AVEVA’s half baked web dashboard product and can use grafana or something else sane instead


Hey everyone, this is a post I've been working on the past few months about setting up my own monitoring stack with Grafana for my home server.

I'd love your feedback on how this process could be easier for me, some resources on learning the Grafana query languages, and general comments.

Thanks for taking the time to read + engage!


What does the monitoring actually do for you? I've seen these setups, even setup one for myself a few times (either Grafana or similar such as Netdata, or Linode's Longview) but I've not really seen what it does for me beyond the "your disk is almost full" warnings.


I recently setup basic monitoring using Telegraf + Influx + Grafana. Here are the alert triggers, in order of importance (imo):

* ZFS pool errors. Motivator: one of my HDDs failed and it took me a few days to notice. The pool (raidz1) kept chugging along of course.

* HDD and SSD SMART errors

* High HDD and SSD temperatures

* ZFS pool utilization

* High CPU temperature. Motivator: one of my case fans failed and it took a while for me to notice.

* High GPU temperatures. Motivator: I have two GPUs in my tower, one of which I don't really monitor (used for transcoding).

* High (sustained) CPU usage. I track this at the server level, rather than for individual VMs.


Setting an email address you actually check in /root/.forward would provide most of this, and all of it with the addition of low-tens of lines of shell script and a cron job or two, no? I get that tastes vary, but adding more services to worry about & keep updated to my home server(s) is not my idea of a good time. I doubt the custom pieces required to get all of those alerts via email would take longer than installing and configuring that stack, and then the maintenance is likely to be zero for so long that you'll probably replace the hardware before it needs to be touched again (... and if you scripted your setup, it'll very likely Just Work on the replacement)


Oh definitely, but only if you are not interested in the visualization side.

I wanted the ability to quickly see the current & historical state of these and other metrics, not just configure alerts.

I’m also omitting the fact that I have collectors running inside different VMs on the same host. For example, I have Telegraf running on Windows to collect GPU stats.


Ah, yeah, that probably won't be enough for you then. Need Windows monitoring, and want the graphs—yeah, much bigger pain to get anything like that working via email.


Continuous performance monitoring of a service, from its inception. I'm building a storage service using SeaweedFS and also a web UI for another project. One thing I'm looking at doing is using k6[1] in order to do performance stress testing of API endpoints and web frontends on a continuous basis under various conditions.[2] For example, I'm trying to lean hard into using R2/S3 for storage offload, so my question is: "What does it look like when Seaweed offloads a local volume chunk to S3 aggressively, and what is the impact of that in a 90/10 hot/cold split on objects?" Maybe 90/10 storage splits are too aggressive or optimistic to hit a specific number. Every so often -- maybe every day at certain points, or a bigger global test once a week -- you run k6 against all these endpoints, record the results, and shuffle them into Prometheus so you can see if things get noticeably worse for the user. Test login flows under bad conditions, when objects they request are really cold or large paginations occur, etc.

You can run numbers manually but I think designing for it up front is really important to keep performance targets on lock. That's where Prometheus and Grafana come in. And I think looking at performance numbers is a really good way to help understand systems dynamics and helps you ask why something is hitting some threshold. On the other hand, there are so many tools and they're often fun to play with, it's easy to get carried away. There's also a pretty reasonable amount of complexity involved in setting it up, so it's also easy to just say fuck it a lot of times and respond to issues on demand instead.

[1] http://k6.io/, it's also a Grafana project.

[2] It can test both normal REST endpoints but also browsers thanks to the use of headless chrome/chromium! So you can actually look at first paint latency and things like that too.


I have been using Zabbix to monitor my servers for the last years, since I wanted something simple and this Grafana/Prometheus stack always scared me because, as the OP says, of the amount of “moving parts”.

Zabbix has been quite solid and has lots of templates for different servers (linux, windows, etc), triggers and can also monitor docker containers (although i never tried that).

The only thing Zabbix cant do well is log file monitoring, so I am considering something like an ELK stack as an addition.


I utterly dislike Zabbix (enough to login here and complain). I guess that if it fits your needs is all good and fine, but as someone that has been in charge of defining and feeding it with LLD rules and registering multi-dimension metrics with Zabbix Sender, I feel scarred by it.

I cannot find my way around the Zabbix web interface neither and most of the templates, rules and macros system confused me, deeply.

On the other hand we have a Prometheus + Grafana stack for another system and the model makes all the sense to me. I guess that there is something in time series and graph plotting that just clicks with me.


I've monitored two whole sites with Zabbix, dozens of servers at different companies, and everyone was very satisfied with it, myself included. Zabbix has extensive documentation, no one should be confused after reading the manual, all is explained there. I've fed it through zabbix_sender too, and while it can be a complicated setup, if you design it well it will seldomly need maintenance.


For a homelab use case, Zabbix is all you need. No grafana or ELK or Kibana orother overcomplicated solution, just a Zabbix instance. Simple TCP checks will cover most of your services, and there's web monitoring for special cases. Beyond that and CPU/RAM/Storage monitoring, there's nothing much else to do.


I'm satisfied with Zabbix too. With something like what OP described, I'd always be worried some integration between all these 'moving pieces' could break and my monitoring would be down without me knowing. Definitely appreciate simplicity with regards to monitoring.


I use Zabbix and Grafana. Grafana has a zabbix data source plugin so you can have best of both worlds really.


Mildly related: can anyone recommend a time series database that supports easy aggregation by week (with the ability to configure the start of the week) and month? I'm looking for something to switch from InfluxDB which I'm currently using. The linked article is using Prometheus which also doesn't appear to support this functionality.


You could take a look at Postgres + TimescaleDB extension, which offers a nice time_bucket() function on its hypertables[1]. You can also materialize using continuous aggregates („self updating“ materialized views).

1: https://docs.timescale.com/api/latest/hyperfunctions/time_bu...


Thanks, this looks exactly what I want. Sensible interval origins [1] too (January 1, 2000 for months and years, and January 3 2000, a Monday, for weeks) and also configurable.

[1] https://docs.timescale.com/use-timescale/latest/time-buckets...


would love an answer to this as well! something with great Python (Flask, maybe even SQLAlchemy) support would be cool too


Is there anything easier for logs? Basically glorified ripgrep?


Don't use Grafana for logs. Its (or any other basically, including Kibana or Graylog) interface sucks and I used Metabase which has much more friendly interface to show logs in tabular format instead of throwing raw logs like the others.

I collect logs with Vector on each instances and sent to central ClickHouse which Metabase reads from.


Interesting. I am already trying that with dBeaver and ClickHouse with docker.

Used this tutorial:

https://clickhouse.com/docs/en/integrations/vector

My services usually produce around 2GB of log data per day. From quick read on the CH I beleive it should not be a problem. Not sure how big the database it is but zip compressed log data is around that size for entire month.


I think Loki is pretty much the easiest thing you can find (if you want it to be multi server, at least). Loki whole approach comes down to avoiding expensive indexing (compared to Elastic search et all.), and really on "grep" for searching instead.


What I hate about Elastic is it special grep syntax I can never get right...

I tried loki around v1.0 and it didn't seem to offer much back then...


check out netdata if y'all haven't already - incredible software


People should realize the best part of Netdata is that it can export data to be stored for Grafana to consume.

I don't like its own UI but no need to use it and it can easily gather metrics from systemd services and containers.


I recently set up packet loss monitoring on a Raspberry Pi, using Prometheus for logging and graphing.

https://video.nstr.no/w/hjTH3Vggn2fvpTrQitMmVP

I would like to set up Grafana and more monitoring as well, on some of my other machines. But for now this is what I have :D


Shameless plug for AppScope (https://github.com/criblio/appscope) which is designed for exactly this. Capturing observability data from processes in your environment without code modification, and shipping the data off to tools like grafana for monitoring.


+1 on AppScope!


Has anyone had lots of trouble configuring Grafana via YAML from the documentation? A lot of it is kind of hard to follow.

I've found that the ability to (pre)configure Grafana without clicking around in it is pretty difficult.


I recommend using your preferred flavor of configuration management tool. It is tricky, especially when you want to provision multiple users in different Grafana organizations, data sources, and their dashboards, but it can be done (I prefer Puppet because of its flexible language, but Ansible should also work).


shameless plug for uptimeFunk (https://uptimefunk.com) that i soft launched a some time ago. I wanted some uptime monitoring with nice ui and a few advanced features that i didn't find anywhere: - monitoring mongo db/replicaset status

- monitoring sql databases with basic sql queries

- monitoring host cpu, ram and disk usage

- monitoring docker containers

- and being able to monitor all of this through ssh tunnels because not all my services are on the internet


1. Looks nice 2. Where are some documentation pages showing how simple it is to set up for example monitoring a sql database? 3. Expensive 4. No alerts?


We've been using nagios and munin for years, this stack is rock solid. We added recently ELK. This feels overkill, heavyweight and fragile.


This is for a homelab. I think overkill is the point.


You can configure some nagios addons for performance metrics collecting, but is better to have a single/efficient/granular enough metrics collector. And time series monitoring helps a bit being proactive on bad trends or to give good context for past events.


I went down the Grafana rabbit hole, and without a doubt, it's a fantastic tool. It can handle just about any kind of data you throw at it, and when it comes to visualizing time series data, it's second to none. That said, it's a slog to set up and configure, but once finished, I had a beautiful dashboard for my home media server, and life was good. Unfortunately, a few months later, I was forced to upgrade and lacked the time to reconfigure Grafana. So, as a stopgap, I installed Netdata... fast-forward two years, and today I still haven't reconfigured Grafana, nor do I plan to.

For my use case, a home media server, Netdata turned out to be way simpler to set up, and, most importantly, way less of a hassle/dink-around. It's a basic plug-and-play operation with auto-discovery. While the dashboard isn't nearly as beautiful or configurable, it gets the job done and provides everything I pretty much need or want. It offers a quick overview, historical metrics (over a year of data) to analyze trends or spot potential issues, and push/email notifications if something goes awry.

If you decide to go down this route, there are two major items:

1. You'll need to configure the dbengine[1] database to save and store historical metric data. However, I found the dbengine configuration documentation to be a bit confusing, so I'll spare you the trouble - just use this Jupyter Notebook[2]. If needed, adjust the input, run it, scroll down, and you'll see a summary of the number of days, the maximum dbengine size, and the yaml config, which you can copy, paste, and voila.

2. If you're hoarding data, you'll probably want to set up smartmontools/smartd[3] in a separate Docker container for better disk monitoring metrics. However, I think you can enable hddtemp[4] with Netdata through the config if you don't want or need the extra hassle. You can have Netdata to query this smartd container, but with a handful of disks, it ends up timing out frequently, so I found it's best to simply set up smartd/smartd.conf to log out the smartd data independently. Then all you need to do is tell Netdata where to find the smartd_log[5], and Netdata handles the rest.

Boom, home media server metrics with historical data, done. It still takes a bit of time to set up, but way less than Grafana. Anywho, hopefully, this saves you from wasting as much time as I did. And if you're looking for a smartd reference, shoot me a reply, and I'll tidy up and share my Docker config/scripts and notes.

[1] https://learn.netdata.cloud/docs/typical-netdata-agent-confi... [2] https://colab.research.google.com/github/andrewm4894/netdata... [3] https://www.smartmontools.org/wiki [4] https://github.com/vitlav/hddtemp [5] https://learn.netdata.cloud/docs/data-collection/storage,-mo...


Is there way to self-host the netdata "cloud"?


Unfortunately, no. At the same time, you don't need their "cloud" to use or run an instance of netdata. From what I gather, it's more or less intended to help monitor multiple instances of netdata on different machines. When I first installed netdata, I initially thought it required the use of their cloud to store historical metrics, but this is not the case. You just need to configure the dbengine, as I mentioned in my post, and you'll be good to go without their cloud.


There's an Open Source version: https://github.com/netdata/netdata

I don't know if it has the same features or not, but it looks like you can set it up yourself.


Just push to github and people will contribute the rest for you. Easy!


With 40 containers I would go kubernetes and with Kube stack you basically have this up and running in 5 minutes.

Aligning metric endpoints for fine-tuning.

Add tracing to it in a few more clicks




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: