Like: The interesting tools you find when people "show" their tool, and then everyone mentions a bunch of other similar tools in the comments.
Dislike: Down voting. Down voting just makes for an echo chamber (as someone else already mentioned.) It's a terrible, nasty, way of punishing people who's posts don't fit the popular cliques viewpoints. It should be replaced with either a mandatory "this is why I down voted" comment, or flags that everyone can see. Like "off topic", "insulting", "spam", or things like that. At the least, it should cost karma to down vote. Oh, and if you down vote, or flag, and it gets reversed, you should lose karma.
Not sure I haven't run across it before, but this is the first time I've tried using Netdata. Looks like it is very good for metrics, at least in the 10 minutes I have spent installing it on my local desktop and poking around the ui there.
I'm not seeing anything in it for logs, though. I'm guessing it doesn't aggregate or do anything with logs? What do you use for log aggregation and analysis?
I'm very interested because I've been getting frustrated with the ELK Stack, and the Prometheus/Grafana/Loki stack has never worked for me. I'm really close to trying to reinvent the wheel...
If you want easy to install, maintain and use system for logs, then take a look at VictoriaLogs [1] I'm working on. It is just a single relatively small binary (around 10MB) without external dependencies. It supports both structured and unstructured logs. It provides intuitive query language - LogsQL [2]. It integrates well with good old command-line tools (such as grep, head, jq, wc, sort, etc.) via unix pipes [3].
Prometheus has become ubiquitous for a reason. Exporting metrics on a basic http endpoint for scraping is as simple as you can get.
Service discovery adds some complexity, but if you’re operating with any amount of scale that involves dynamically scaling machines then it’s also the simplest model available so far.
What about it doesn’t work for you?
Edit: I didn’t touch on logging because the post is about metrics. Personally I’ve enjoyed using Loki better than ELK/EFK, but it does have tradeoffs. I’d still be interested to hear why it doesn’t work, so I can keep that in mind when recommending solutions in the future.
Last time I tried Prometheus was years ago. So I don't know how much might have changed... I gave it a good month or two of effort trying to get the stack to do what I needed and never really succeeded.
Just my opinion, but I honestly don't think the scraping model makes much sense. It requires you expose extra ports and paths on your servers that the push model doesn't require. I'm not a fan of the extra effort required to keep those ports and paths secure.
Beyond that, promql is an extra learning curve that I didn't like. I still ran into disk space issues when I used a proper data backend (TimescaleDB). Configuring all the scrapers was overly complicated. Making sure to deploy all the collectors and the needed configuration was rather complicated.
In comparison, deploying Filebeat and Metricbeat is super simple, just configure the yaml file via something like Ansible and you're done. Elastic Agent is annoying in that you can't do that when using Fleet, or at least I have yet to figure out how to automate it. But it's still way easier than the Prometheus stack.
I've tried to get Loki to work 2 or 3 times. Never have really succeeded. I think I was able to browse a few log lines during one attempt, I don't think I even got that far in the other attempts... The impression I came away with was that it was designed to be run by people with lots of experience with it. Either that, or it just wasn't actually ready to be used by anyone not actively developing it.
So, yeah, while I figure a lot of people do well with the Prometheus/Grafana/Loki stack, it just isn't for me.
The most basic setup, and the one typically used until you need something more advanced, is using Prometheus for scraping and as the TSDB backend. If you ever decide to revisit prometheus, you’ll likely have better luck starting with this approach, rather than implementing your own scraping or involving TimescaleDB at all (at least until you have a working monitoring stack).
There used to be a connector called Promscale that was for sending metrics data from Prometheus to Timescale (using Prometheus’ remote_write) but it was deprecated earlier this year.
Also important to add: using prometheus as the tsdb is good for short term use (on the order of days to months). For longer retention you could offload it elsewhere, like another Prometheus-based backend or something else SQL-based, etc
We also have some specific logs collectors too - i think in here might be best place to look around at the moment, should take you to the logs part of the integrations section in our demo space (no login needed, sorry for the long horrible url, we adding this section to our docs soon but at moment only lives in the app)
Nice to see that the log analysis is being worked on.
I'll see if I can figure out the integrations you pointed out. They look more like they are aimed at monitoring the metrics of the tools, not using the tools to aggregate logs. Right?
The way most ops systems treat logs and metrics as completely separate areas has always struck me as odd. Both are related to each other, and having them in the same system should be default. That's why I've put as much effort into the ELK Stack as I have. They've seemed to be the only ones who have really grasped that idea. (Though it's been a year or two since I've really surveyed the space...)
One question not log related, is it required to sign up for a cloud account to get multiple nodes displaying in the same screen? From the docs on streaming, I think you can configure nodes to send data to a parent node without a cloud account, but I either haven't configured it properly yet, or something else is in the way, since the node I'm trying to set up as a parent isn't showing anything from the child node.
FYI, you need to add the api-key config section to the stream.conf file on the parent node in order to enable the api key and allow child nodes to send data to the parent node. I thought it went into the netdata.conf file... I also kinda wonder why it matters what file has what config since the different config sections all have section headings like `[stream]` or `[web]`.
So, the answer to my question is that you can get multiple nodes showing up without a cloud account. Just have to configure it correctly.
I have used https://github.com/openobserve/openobserve in several hobby projects and liked it. It's an all-in-one solution. It's likely less featureful than many others but a single binary and everything in one place pulled me in and worked for me so far.
I'm not sure if the version in use at $workplace is out of date or incorrectly configured, but it is a dreadful prometheus client in that it doesn't use labels, it just shovels all the metadata into the metric name like a 1935 style graphite install, making most of the typical prometheus goodness impossible to use.
From my experience, no silver bullets. Let metric software do metric and log software do logs.
At the very least at the database level. Maybe we will get visualisation engine that merges both nicely but database wise the type of data couldn't be any different.
Makes me wonder, at what point will people start self hosting again, instead of getting dragged around by large companies like Google and Microsoft? I manage my own email server for myself and family. That, plus Next cloud gives me everything I'd need from Google/etc. Plus I host a few extra sites and services that GSuite/365 can't replicate. I'm fairly confident I could scale to a few hundred users if I took it seriously. Even more if I had someone else as a backup. The reliance on vendors and sass products for everything kinda baffles me.
I'm sure there's an official place I could ask this, but since you are here, maybe you could get some more detailed examples in place for the autoinstall [0] docs?
Specifically I'm about to try using the apt section, and I'm not exactly sure how to translate the curtin docs to the autoinstall format... Especially when it comes to adding keys. And a comment about needing to use exact bytes for normal partitions in the storage section would have saved me a few hours the other day... :)
Oh, and thanks for linking to that Diátaxis framework. I might have use for that at work.
I currently use Vimbadmin to manage domains and addresses for my personal email server. So I'd need to be able to keep using it, or be able to replace it for both the new jmap/imap server and postfix. SQL support would let me continue using it.
That said, I think I'll find some time to give Stalwart JMAP a try. I've been curious about JMAP for a while, and there are a couple things about Dovecot I'm not too fond of...
I can see the appeal. I've been working on containerizing apps at work. There's a lot of complication there. But I also really really want to be able to run updates whenever I want. Not just during a maintenance window when it's ok for services to be offline. And containers are the most likely route to get me there.
At home, well, I enjoy my work, so I containerize all my home stuff as well. And I have a plan to switch my main linux box to ProxMox and then host both a Nomad cluster and a Rancher cluster.
If I weren't interested in the learning experience, I'd just stick with docker-compose based app deployment and Ansible for configuring the docker host vm.
Docker especially with Portainer is pretty serviceable for an at home setup.
Once the line has been crossed with "I should be backing this up", "I should have a dev copy vs the one i'm using", the existing setups can port quite well/into something like ProxMox to keep everything in a homelab setup running relatively like an appliance with a minimum of manual system administration, maintenance and upgrading.
If Docker Swarm was a bit better, I'd have all my test instances at work running a Docker Swarm, Portainer, and Traefik stack. Unfortunately Swarm has some quirks that make running stateful apps a bit difficult.
For home use, I've been experimenting with Portainer. It seems to work well for apps I'm not developing and am just running.
So, for context, my experience is limited to trying to get a MariaDB Galera cluster running. Specifically using the Bitnami image. So my issues might not apply to every single stateful app out there. I'm also running all of this on a vsphere in our own data center. Not in the cloud.
Swarm does not support dependencies between services. See [0]. It also does not support deploying replicas one at a time. See [1] where I'm asking for that support.
In the case of Galera, you need a master node to be up and running before you add any new nodes. I'm pretty sure that when you're initiating any kind of stateful clustered app, you'd want to start with one node at a time to be safe. You can't do that in Swarm using a replicated service. All replicas start at the same time.
Using a service per instance might work, but you need to be sure you have storage figured out so that when you update your stack to add a new service to the stack, the initial service will get the data it was initiated with. (Since when you restart a stack to add the new service, the old service will also get restarted. If I'm remembering what I found correctly.)
Then there's updating services/replicas. You cannot have Swarm kill a service/replica until after the replacement is actually up and running. Which means you'll need to create a new volume every time you need to upgrade, otherwise you'll end up with two instances of your app using the same data.
To complicate things, as far as I can tell, Swarm doesn't yet support CSI plugins. So you're pretty much stuck with local or nfs storage. If you're using local storage when deploying new replicas/services, you better hope the first replica/service starts up on same node it was on before...
All that combined means I haven't figured out how I can run a Galera cluster on Swarm. Even if I use a service per instance, updates are going to fail unless I do some deep customization on the Galera image I'm using to make it use unique data directories per startup. Even if I succeed in that, I'll still have to figure out how to clean out old data... I mean, I could manually add a new volume and service, then manually remove the old volume and service for each instance of Galera I'm running. But at the point, why bother with containers?
Anyway, I'm pretty sure I've done my research and am correct on all of this, but I'd be happy to be proven wrong. Swarm/Portainer/Traefik is a really really nice stack...
If you are interested in making this work with any of the constraints, I am sure that there is a way to work around all these issues.
About [0]/[1]: I guess you are right in this not working out of the box, but this could possibly be worked around with a custom entrypoint that behaves differently on which slot the task is running in.
> (Since when you restart a stack to add the new service, the old service will also get restarted. If I'm remembering what I found correctly.)
Are you sure the Docker Image digest did not change? Have you tried pinning an actual Docker Image digest?
> Then there's updating services/replicas. You cannot have Swarm kill a service/replica until after the replacement is actually up and running. Which means you'll need to create a new volume every time you need to upgrade, otherwise you'll end up with two instances of your app using the same data.
Is this true even with "oder: stop-first"?
> To complicate things, as far as I can tell, Swarm doesn't yet support CSI plugins. So you're pretty much stuck with local or nfs storage. If you're using local storage when deploying new replicas/services, you better hope the first replica/service starts up on same node it was on before...
True, but there are still some volume plugins that work around and local storage should work if you use labels to pin the replicas to nodes.
Finally have time to look into your suggestions. Hopefully you check your comments every once in a while...
> Are you sure the Docker Image digest did not change? Have you tried pinning an actual Docker Image digest?
Mostly sure. Many of my tests only changed the docker-compose file, not the actual image. So even though GitLab was rebuilding the image, the image digest would not have changed. I'll try to find time to pin the digest just to double check, though.
> Is this true even with "oder: stop-first"?
Er, did you mean "order"? I only see `--update-order` as a flag on the `docker service update` command. I do not see it in the docker-compose specification. So far all my tests have been through Portainer's stack deployment feature. So all changes are in my docker-compose file.
Maybe it would just work if I stuck it in the deploy.update section? I'll try.
> True, but there are still some volume plugins that work around and local storage should work if you use labels to pin the replicas to nodes.
I have tried pinning specific services to specific nodes to make local storage work. And I've use labels to force only one replica per node when using replicas.
What volume plugins are you thinking of? I haven't found any that seem to be maintained outside of local storage and nfs. And maybe some that would work if I were in some cloud host...
Anyway, thanks for giving me a couple things to try. :)
Yes, we'd have more than one instance of the app running. Ideally 3 or 5 instances. And they'd be load balanced. So I could update the app one instance at a time with no downtime. At least that is the goal.
What options are there for managing persistent storage in Nomad? I've been through the very basic initial getting started tutorial, but haven't had a chance to dig any deeper. When I was trying kube, I found Longhorn to be the one thing I liked. Is there anything like that for Nomad?
I've been trying to get a MariaDB Galera cluster running on Docker Swarm for the past couple weeks and have been somewhat stymied a few things. Something like Longhorn would help a lot.
Nomad supports CSI like Kubernetes. Unfortunately many CSI storage providers assume kubernetes or adhere to kubernetes behavior even when it deviates from the spec, so there are gotchas.
That being said they’re often glaring. So please report an issue to Nomad and the upstream of you hit an issue!
Nomad has 2 other storage features:
1. Host volumes where nodes can advertise a volume by name and jobs can request being placed on nodes with that volume available. I think a lot of folks run databases this way so they can statically define their database nodes and not worry about the numerous runtime points of failure CSI introduces.
2. Ephemeral disk stickiness and migration: during deployments an instance of a job’s intrinsic local storage can be set to either get reused on the same node (sticky) or migrated to wherever the new instance is placed.
Ephemeral disks is the oldest storage solution and is only a best effort (unlike CSI which may be able to reliably reattach and existing volume to a new node in the case of node failure). However I’ve heard of a lot of people happily using it for already distributed databases like Elasticsearch or Cassandra where the database can tolerate nomad’s best effort failing.
(Sorry for lack of links as I’m on mobile. Googling the key phrases you care about should get you to our docs quickly and easily)
I've been building a side project in Symfony for the past couple weeks, so this caught my eye.
Um, it might just be me, but on both Android and Linux Firefox, the job description page doesn't actually display the job description. Just the header and footer. I had to switch over to Chrome/Chromium.
Dislike: Down voting. Down voting just makes for an echo chamber (as someone else already mentioned.) It's a terrible, nasty, way of punishing people who's posts don't fit the popular cliques viewpoints. It should be replaced with either a mandatory "this is why I down voted" comment, or flags that everyone can see. Like "off topic", "insulting", "spam", or things like that. At the least, it should cost karma to down vote. Oh, and if you down vote, or flag, and it gets reversed, you should lose karma.