
Monitoring your own infrastructure using Grafana, InfluxDB, and CollectD - crecker
https://serhack.me/articles/monitoring-infrastructure-grafana-influxdb-connectd/
======
avthar
Echoing the sentiment expressed by others here, for a scalable time-series
database that continues to invest in its community and plays well with others,
please check out TimescaleDB.

We (I work at TimescaleDB) recently announced that multi-node TimescaleDB will
be available for free, specifically as a way to keep investing in our
community: [https://blog.timescale.com/blog/multi-node-petabyte-scale-
ti...](https://blog.timescale.com/blog/multi-node-petabyte-scale-time-series-
database-postgresql-free-tsdb/)

Today TimescaleDB outperforms InfluxDB across almost all dimensions (credit
goes to our database team!), especially for high-cardinality workloads:
[https://blog.timescale.com/blog/timescaledb-vs-influxdb-
for-...](https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-time-
series-data-timescale-influx-sql-nosql-36489299877/)

TimescaleDB also works with Grafana, Prometheus, Telegraf, Kafka, Apache
Spark, Tableau, Django, Rails, anything that speaks SQL...

~~~
aeyes
I don't consider TimescaleDB to be a serious contender as long as I need a
2000 line script to install functions and views to have something essential
for time-series data like dimensions:

[https://github.com/timescale/timescale-
prometheus/blob/maste...](https://github.com/timescale/timescale-
prometheus/blob/master/pkg/pgmodel/migrations/sql/1_base_schema.up.sql)

[https://github.com/timescale/timescale-
prometheus/blob/maste...](https://github.com/timescale/timescale-
prometheus/blob/master/extension/sql/timescale-prometheus.sql)

~~~
cevian
(TimescaleDB engineer here). This is a really unfair critique.

The project you cite is super-optimized for the Prometheus use-case and data
model. TimescaleDB beats InfluxDB on performance even without these
optimization. It's also not possible to optimize in this way in most other
time-series databases.

These scripts also work hard to give users a UIUX experience that mimicks
PromQL in a lot of ways. This isn't necessary for most projects and is a very
specialized use-case. That takes up a lot of the 2000 lines you are talking
about.

~~~
aeyes
> TimescaleDB beats InfluxDB on performance even without these optimization

Would you mind sharing the schema used for this comparison? Maybe I missed it
in your documentation of use-cases. When implementing dynamic tags in my own
model, my tests showed that your approach is very necessary.

~~~
cevian
You can look at what we use in our benchmarking tool
[https://github.com/timescale/tsbs](https://github.com/timescale/tsbs)
(results described here [https://blog.timescale.com/blog/timescaledb-vs-
influxdb-for-...](https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-
time-series-data-timescale-influx-sql-nosql-36489299877/)).

Pretty much it's a table with time, value, tags_id. Where the tags table is
id, jsonb

------
linsomniac
I've been doing monitoring of our ~120ish machines for 3-4 years now using
Influx+Telegraf+Grafana, and have been really happy with it. Prior to that we
were using collectd+graphite and with 1 minute stats it was adding some
double-digits %age utilization on our infrastructure (I don't remember exactly
how much, but I want to say 30% CPU+disk).

Influxdb has been a real workhorse. We suffered through some of their early
issues, but since then it's been extremely solid. It just runs, is space
efficient, and very robust.

We almost went with Prometheus instead of Influx (as I said, early growing
pains), but I had just struggled through managing a central inventory and
hated it, so I really wanted a push rather than pull architecture. But from my
limited playing with it, Prometheus seemed solid.

~~~
rehevkor5
It's so much easier to write incorrect/misleading queries in influxql than in
promql. And you can't perform operations between two different series names in
influxdb, last I looked. That makes it impossible to do things like ratios or
percentages unless you have control over your metrics, and structure them the
way influx likes. Also, no support for calculating percentiles from Prometheus
histogram buckets.

~~~
mhall119
You can do these things (and much, much more) with the new Flux language

------
carlosdp
Gotta say though, having rolled grafana and prometheus and such on my own
plenty of times before, if you are a startup and can afford Datadog, use
Datadog.

~~~
dijit
I just costed a datadog deployment (based on your comment) and it would cost
me my yearly salary every month.

No thanks. :/

~~~
cs-szazz
Wait, you're saying it costs 12x as much as your salary? That doesn't seem
right... the company I'm at uses it pretty heavily and we're at about 1x of a
FTE salary (and it's still super worth it)

~~~
klohto
World’s salaries aren’t only US based :) Datadog for a smaller Western Europe
startup is going to cost more than their devs salaries.

~~~
user5994461
Western salaries are not that low and true costs to the employer is typically
double the perceived salary. That means we're talking tens of thousands of
euros, so thousands of hosts (list price is certainly negotiable at this
scale).

If they've got a thousand of hosts, the costs of the infrastructure itself
must dwarf the salary of any developer by orders of magnitude, the salary of a
developer is simply irrelevant when it comes to acquiring software/hardware.

~~~
dijit
> true costs to the employer is typically double the perceived salary.

This is true, but my comment was an offhanded way to say that my "salary" (as
in, the one on my contracts and the one I "see") is less than a month of
Datadog for our number of hosts.

As for the rest of your comment, I wish it was true.

Developer salaries outside of the capitals is quite low in Europe, and even
inside the capitals only go to "near double"

So, instead of 12x it becomes 6x developer costs per annum, which is a fair
whack of money.

For me to justify spending "3-6" peoples worth of money it had better save
"3-6" peoples worth of time.

~~~
user5994461
> For me to justify spending "3-6" peoples worth of money it had better save
> "3-6" peoples worth of time.

Well, it does in my experience, especially if you have to handle 2000+ hosts,
that's some serious infra there, need serious tooling.

May I ask which country is it?

~~~
aprdm
I think another thing to keep in mind is that not necessarily hardware
infrastructure size equals to profitability.

Some industries need a lot of hardware because they crunch a lot of data but
they aren't software companies. Think Computer graphics rendering.

Paying a FTE salary for software is crazy for them. I would love to see a
ration of developers/infrastructure per industry/company.

------
pachico
I approached InfluxDB since it looked promising. It did actually served its
purpose when it was simple and Telegraf was indeed handy. Now that I have more
mature requirements I can't wait to move away from it. It gets frozen
frequently, it's UI Chronograph is really rubbish, functions are very limited
and managing continuous queries is tiresome.

I'm now having better results and experience storing data in ClickHouse (yes,
not a timeseries dB).

From time to time I also follow what's coming in InfluxDB 2.0 but I must
confess that 16 betas in 8 months are not very promising.

It might just be me.

~~~
willvarfar
I have also had scalability and reliability issues with influx. And full of
silly limitations like tagset cardinality and not being able to delete points
in a specific retention policy etc. Am moving to classic rdbms and
timescaledb.

~~~
pachico
Indeed, cardinality limit is a very painful aspect which blocked us since day
one for certain metrics. I confess, at the time I didn't know any better. Now
I wouldn't recommend it under any circumstances.

------
overcast
Really, REALLY tried to love InfluxDB. But its systems requirements,
performance, and features are poor compared to things like TimescaleDB.

~~~
julioo
What about storage? We are running influxdb and we are looking for
alternative. But a point where Influx is good is storage.

~~~
k-rus
TimescaleDB provides such community features as compression, which allows to
save space a lot, and continuous aggregates, which gives performance and save
space if used together with retentions.

------
jakobdabo
I still can't find any alternative to the old, RRD-based Munin. It is so
simple. You want to add a new server to monitor? Just install there the node
part, enable any required additional plugins (just by creating a couple of
soft-links), add one-line configuration to the main server with the new node's
IP address, and you are done.

Also, the aesthetics of the UX, you see all the graphs in one single page[1],
no additional clicks are required - a quick glance with a slow scroll and you
can see if there were any unusual things during the last day/week.

[1] - publicly available example, found by googling -
[https://ansible.fr/munin/agate/agate/index.html](https://ansible.fr/munin/agate/agate/index.html)

~~~
mhall119
That's more steps than using InfluxDB and Telegraf.

Check out: [https://github.com/influxdata/community-
templates/tree/maste...](https://github.com/influxdata/community-
templates/tree/master/linux_system)

------
Papric0re
[Offtopic (a bit)] Lots of you are talking about metric monitoring. But do you
have recommendations when it comes to (basic) security Monitoring? I would
usually go for the Elastic-Stack for that purpose, especially because Kibana
offers lots of features for security monitoring. But I feel like these stacks
are so big and bloated. I basically need something to monitor network traffic
(Flows and off-Database retention of PCAPs) and save some security logs (I'm
not intending on alerting based on logs, just for retention). But being able
to have a network overview, insight into current connections (including
history) is a very useful thing. Can anybody recommend something, that's maybe
a bit lighter than an entire Elastic-Stack?

~~~
floren
I think Gravwell ([https://gravwell.io](https://gravwell.io)) might be what
you're looking for--but I work for Gravwell so I may be biased! If I can be
forgiven a short sales pitch, we've built a format-agnostic storage & querying
system that easily handles raw packets, Netflow records (v5, v9, and IPFIX),
collectd data, Windows event logs, and more. You can see some screenshots at
[https://www.gravwell.io/technology](https://www.gravwell.io/technology)

We have a free tier which allows 2GB of data ingest per day (paid licenses are
unlimited) which should be more than enough for capturing logs and flows. The
resources needed to run Gravwell basically scale with how much data you put
into it, but it's a lot quicker to install and set up than something like
Elastic, in our opinion ([https://www.gravwell.io/blog/gravwell-installed-
in-2-minutes](https://www.gravwell.io/blog/gravwell-installed-in-2-minutes))

Edit: it's currently a bit roll-your-own, but we're really close to releasing
Gravwell 4.0 which enables pre-packaged "kits" containing dashboards, queries,
etc. for a variety of data types (Netflow, CoreDNS logs, and so on)

~~~
iudqnolq
When you say

> Gravwell is developed and maintained by engineers expert in security and
> obsessed with high performance. Therefore our codebase is 100% proprietary
> and does not rely on open source software. We love open source, but we love
> our customers and their peace of mind a lot more!

does that mean you've even rolled your own webserver? Programming language?

~~~
floren
That's... not good copy. I think it must have been written long ago. We use
open-source libraries (with compatible licenses, of course) and even maintain
our own set of open-source code
([https://github.com/gravwell](https://github.com/gravwell)). I'll talk to the
guys who maintain the website and get that fixed. Thanks for pointing it out!

Edit: We've had lots of people assume we use Elastic under the hood, so I
wonder if that was just a (poorly-worded) attempt to indicate that our _core
storage and querying_ code is custom rather than some existing open-source
solution.

~~~
Papric0re
Maybe you should just wipe that paragraph completely. I get that investors
like to see that you are using proprietary code, but I wouldn't expect you to
be faster with that. Especially when running against Elastic, which has over
1.400 contributors currently. But you don't necessarily need to. You can get
me with being focused on the right thing and not bloating your software. Lot
of big projects start to loose focus and start doing everything, hence become
worse doing their main job.

Especially when it comes to security, I'd like to see the lowest complexity
possible. Harden your software instead of feature-fu __around. That would be a
good USP (I 've got the feeling that no vendor has realized this so far - but
customers neither did).

------
hnarn
Some people are probably going to throw some shade on me for saying this since
it's so out of fashion but in my mind, when it comes to some types of basic
monitoring (SNMP monitoring of switches/linux servers, disk space usage,
backups running and handling them when they don't) then Nagios does get the
job done. It's definitely olives and not candy[1] but it's stable, modular,
_relatively_ easy to configure (when you get it) and it just keeps chugging
away.

If anyone has any Nagios questions, I'd be happy to answer them. I'm a Nagios
masochist.

(Also, I can recommend Nagflux[2] to export performance data, metrics, to
InfluxDB because noone should have to touch RRD files)

[1]: [https://lambdaisland.com/blog/2019-08-07-advice-to-
younger-s...](https://lambdaisland.com/blog/2019-08-07-advice-to-younger-self)

[2]:
[https://github.com/Griesbacher/nagflux](https://github.com/Griesbacher/nagflux)

~~~
BurritoAlPastor
Nagios is the Jenkins of monitoring. It's popular because you can get it
running in an afternoon, and it's easy to configure by hand.

It then rots within your infrastructure, because it resists being configured
any way _except_ by hand. I've built two systems for configuration-management
of Nagios (at different companies), and it's an unpleasant problem to solve.

Prometheus's metric format and query syntax are cool, but the real star of the
design is simply this: you don't have to restart it, or even change files on
your Prometheus server, when you add or remove servers from your environment.

~~~
znpy
I have to use an icinga instance from time to time (icinga is a nagios fork).
I really can't see the value, beyond seeing if a service is up or down.

I'm surprised no one has named Zabbix. Zabbix is way better. I hadn't the
chance to use Zabbix past 4.something but it's worth it.

I've been using Prometheus/grafana and frankly the value I see is it's out of
the box adaptability at capturing a mutating data source (example: metrics
about ephemeral pods Una kubernetes cluster).

~~~
hnarn
> I really can't see the value, beyond seeing if a service is up or down.

This is an extreme oversimplification. The value is not in "seeing" if
something is "up or down", the value is in the modularity of what a "service"
can mean in the first place (anything you can script -- and the eco-system of
plugins is huge), the fact that you don't have to "see" it (because
notifications are extremely modular), the fact that escalations of issues can
happen automatically if they are not resolved, and the fact that event-
handlers in many cases can help you resolve the issue automatically without
even having to raise an alert in the first place.

Nagios is a monitoring tool built with the UNIX philosophy in mind, and it's
ingenious in its simplicity: decide state based on script or binary exit
codes, relate dependencies between objects to avoid unnecessary
troubleshooting, notify if necessary (again, with scripts/binaries) and/or try
to resolve if configured. It hooks into a server frame of mind very well if
you're a sysadmin.

Sure, if you main use case is "mutating data sources" and collecting metrics,
any Nagios flavor won't be for you, because it's not what Nagios is made to
do. There's a reason it's extremely popular in large enterprises, because it
was created for them. No monitoring solution is for everyone and solves every
problem.

------
ddevault
I found the software in this stack to be very bloated and difficult to
maintain. Large, complicated software has a tendency to fall flat on its face
when something goes wrong, and this is a domain where reliability is
paramount. I wrote about a different approach (Prometheus + Alertmanager) for
sourcehut, if you're curious:

[https://sourcehut.org/blog/2020-07-03-how-we-monitor-our-
ser...](https://sourcehut.org/blog/2020-07-03-how-we-monitor-our-services/)

~~~
robotmay
I have a bit of a love/hate relationship with Prometheus. At home I really
like it; it was simple to set up for my needs and most of my configuration is
on my server which then scrapes other machines for the data. However I find it
quite frustrating at scale for work, both in its concepts (it's hard to
describe but it's sort of...backwards?) and in its query performance, although
that might be a side-effect of using it with Grafana and me attempting to
misuse it. By contrast I think the concepts of something like TimescaleDB are
easier to understand when it comes to scaling and optimising that service.

In my previous job I had a very clear use-case for not using Prometheus and
did for a while use InfluxDB (it involved devices sending data from behind
firewalls across many sites). I found it pretty expensive to scale and it fell
over when it ran out of storage, which feels like something that should have
been handled automatically considering it was a PaaS offering.

~~~
ddevault
One point of note for SourceHut's Prometheus use is that we generally don't
make dashboards. I don't really like Grafana. I will sometimes use gnuplot
with styx to plot graphs on an as-needed basis:

[https://github.com/go-pluto/styx](https://github.com/go-pluto/styx)

This is how I made the plots in that blog post.

~~~
dewey
Can't you generate the same kind of graphs you have there with the normal
Prometheus query explorer / web ui?

~~~
ddevault
On a basic level, yes, but I often just use it as a starting point for more
complex gnuplot graphs, or different kinds of visualizations - box plots,
histograms, etc.

------
aequitas
For those who still remember Graphite, the team over at Grafana labs have
started maintaining Graphite-web and Carbon since 2017 and it is still in
active development getting improvements and feature updates. It might not
scale as well as any of the other solutions, but for medium size or homelab
setups it's still a nice solution if you don't like PromQL or InfluxQL.

[https://grafana.com/oss/graphite/](https://grafana.com/oss/graphite/)

~~~
lma21
for simple and fast monitoring solutions, i always opted for collected +
graphite + grafana. In a containerized environment, it's so easy to deploy (0
configuration by default) and monitor a set of 50-100 nodes. Beyond that, disk
tuning and downsampling (pre-aggregation) rules become important.

------
wiradikusuma
There's also
[https://github.com/timberio/vector](https://github.com/timberio/vector)

~~~
victor106
+1 for vector. We moved from Logstash to Vector and we couldn't be happier.
Logstash is awesome but its a memory hog.

With Vector and Toshi you can kinda (I am not sure Toshi is as mature as
Elastic) use them to replace LogStash and Elastic, the missing piece is Kibana

~~~
sevagh
Have you looked into Grafana Loki for logs? If I had to redo one part of our
stack, that's what I'd choose.

------
x87678r
I love prometheus. Its simple and the built in charts are enough without
having to use Grafana on top.

~~~
halfmatthalfcat
The question is how to do long term storage though. Something I've had a bit
of trouble rationing about.

Right now all of my metrics are sitting in a PVC with a 30d retention period,
so we're probably fine but for longer term cold storage the options aren't
great unless you want to run a custom Postgres instance with the Timescale
plugin or something else more managed.

~~~
x87678r
Do you really need long term? I hate throwing away data but realistically I
never really need old performance data. Some stats data is worth keeping but
you can extract a few important time series and store them elsewhere.

~~~
julioo
In any case, Prometheus is throwing away data if a scrap can’t be done. As
clearly described on the website, prometheus is not a metrics system. So
influx and Prometheus are quite different.

~~~
julioo
Mistake in my previous message. I wanted to say that Prometheus is not a log
system. Metrics could be lost in scraping issues. This is ok in some cases but
you have to know that you can loose data.

------
roland35
InfluxDB and Grafana worked great for us when I created a live monitoring
system for a fleet of prototype test robots. It was simple to set up new data
streams. We started with Graphite but switched to InfluxDB for it's
flexibility (Grafana works with both!)

I would add to the guide that you need to be careful about formatting the
lines into InfluxDB because where you out the space and commas determines what
is indexed or not! Also data types should be specific (ie make sure you are
setting integer vs float correctly).

------
majkinetor
You can quickly do this on Windows:

    
    
        cinst influxdb1 /Service
        cinst grafana
        start $Env:ChocolateyInstall\lib\grafana\tools\grafana-*\bin\grafana-server.exe
        git clone https://github.com/majkinetor/psinflux
        import-module ./psinflux
        1..10 | % { $x = 10*$_ + (Get-Random 10); Send-Data "test1 value=$x"; sleep 1 }

------
rattray
I've only ever used third party monitoring tools, but hope to set up a startup
again soon and want to do OSS if I can.

Can anyone comment on Prometheus vs Timescale? What are the tradeoffs? Or
would I use Prometheus on top of Timescale?

~~~
k-rus
You can use Prometheus on top of TimescaleDB. Timescale builds connector and
entire workflow to run Prometheus on top of TimescaleDB and support Grafana in
flexible way. Sorry for the promo :) check for details in
[https://github.com/timescale/timescale-
prometheus](https://github.com/timescale/timescale-prometheus)

~~~
rattray
Thanks!

------
ecoqba11
We switched from InfluxDB to TimescaleDB for our IoT solutions. InfluxDB is
very difficult to work with large datasets and enterprise/region compliance.
We ingest around 100MB data per day and growing.

~~~
mhall119
That doesn't actually sound like a large dataset. Can you describe what kind
of problems you faced with InfluxDB?

------
second--shift
I've looked at these before, and I remember a few years ago when Grafana was
really starting to get big, but I guess I have a bona-fide question: Who
really needs this?

I manage a small homelab infra, but also an enterprise infra at work with
>1,000 endpoints to monitor, and I/we use simple shell scripts, text files,
and rsync/ssh. We monitor cpu load, network load, disk/io load, all the good
stuff basically. The monitor server is just a DO droplet and our collectors
require zero overhead.

The specs list and setup costs in time and complexity are steep with a Grafana
stack - is there any value besides just the visual? I know they have the
ability to do all manner of custom plugins, dashboards, etc, but if you just
care about the good stuff (uptime+performance), what does Grafana give you
that rsync'ing sar data can't?

PS: we have a graphical parser of the data written using python and
matplotlib. very lightweight, and we also get pretty graphs to print and give
to upstairs.

~~~
gjulianm
I'm not experienced with the CollectD stack, but I use Prometheus + Grafana to
monitor probes. My two cents:

\- Fairly lightweight. Prometheus deals with quite a lot of series without
much memory or CPU usage.

\- Integration with a lot of applications. Prometheus lets me monitor not only
the system, but other applications such as Elastic, Nginx, PostgreSQL, network
drivers... Sometimes I need an extra exporter, but they tend to be very light
on resources. Also, with mtail (which is again super lightweight) I can
convert logs to metrics with simple regexes.

\- Number of metrics. For instance, several times I needed to diagnose an
outage and I need a metric that I didn't think about, and turns out that the
exporter I was using did actually store it, it was just that I didn't include
it in the dashboard. As an example, the default node exporter has very
detailed I/O metrics, systemd collectors, network metrics... They're quite
useful.

\- Metric correctness. Prometheus appears to be at least decent at dealing
with rate calculations and counter resets. Other monitoring systems are worse
and it wasn't weird to find a 20000% CPU usage alert due to a counter reset.

\- Alerts. Prometheus can generate alerts with quite a lot of flexibility, and
the AlertManager is a pretty nice router for those alerts (e.g., I can receive
all alerts in a separate mail, but critical alerts are also sent in a Slack
channel).

\- Community support. It seems the community is adopting the Prometheus format
for exposing metrics, and there are packages for Python, Go and probably more
languages. Also, the people who make the exporters tend to also make
dashboards, so you almost always have a starting point that you can fine-tune
later.

\- Ease of setup. It's just YAML files, I have an Ansible role for automation
but you can go with just installing one or two packages in clients and adding
a line to a configuration file in the Prometheus master node.

\- Ease of use. It's incredibly easy to make new graphs and dashboards with
Prometheus and Grafana, no matter if they're simple or complex.

For me, the main points that make me use Prometheus (or any other monitoring
config above simple scripts) is alerting and the amount of metrics. If you
just need to monitor CPU load and simple stats, maybe Prometheus is too much,
but it's not that hard to set up anyways.

~~~
crecker
Author here. I'll probably write another tutorial focusing on Prometheus,
instead of CollectD.

Thanks for suggestion

SerHack

~~~
rckoepke
It would be wonderful if you included limitations as well, to help people make
the right decisions for their tech stack. I've been playing around with
Prometheus lately for environmental monitoring, and long-term retention is
particularly important to me.

During proof-of-concept testing, some historical data on disk perhaps wasn't
lost per se, but definitely failed to load on restart. I haven't worked hard
to replicate this but there are some similar unsolved tickets out there.

Additional traps for new players include customizing
\--storage.tsdb.retention.time and related parameters.

~~~
crecker
Thank you!

------
abhishekjha
This is exactly what I have been planning to do for monitoring my two
Raspberry pis. I am still debating on the metrics collector though. My
workplace uses monitord for AINT and telegraf for wavefront. I have no idea
how well does collectord works.

------
zytek
VictoriaMetrics eats other TSDBs for breakfast.

PromQL support (with extensions) and clustered / HA mode. Great storage
efficiency. Plays well for monitoring multiple k8s clusters, works great with
Grafana, pretty easily deployed on k8s.

No affiliation, just a happy user.

~~~
mekster
I just don't get why VictoriaMetrics doesn't get more visibility.

Maybe they need a PR person.

~~~
valyala
Absolutely! We are working on this.

------
metalliqaz
I use Telegraf, InfluxDB, and Grafana for my home network, in addition to
monitoring my weather station data. It works incredibly well and is so simple
compared to the "old" suite of tools such as RRD and Cacti.

------
PanosJee
Too many parts Netdata will make it a breeze

~~~
linsomniac
Netdata is pretty slick.

------
viraptor
Nothing against choosing this set of apps really, but I'm curious why collectd
and not telegraf which does the same kind of metrics probes and is a part of
the TICK stack.

~~~
bbrks
As somebody new to both, are you able to elaborate on what telegraf/TICK
provides that the collection of programs in the post does not?

I see lots of posts complaining about the stack in the post, but all of the
alternatives posed don't really explain why.

~~~
viraptor
Telegraf and collectd do roughly the same thing. They run some plugins to get
data and push the results to a given metrics sink. I asked because TICK
(Telegraf, InfluxDB, Chronograf and Kapacitor) is a known solution and a
fairly standard way to add elements of monitoring to your system.

~~~
jimktrains2
> TICK (Telegraf, InfluxDB, Chronograf and Kapacitor) is a known solution and
> a fairly standard way to add elements of monitoring to your system.

Amusing that I've never heard of any of this but have heard and used collectd.

It's obviously no where near as common as a lamp stack or anywhere near common
at all, so asking why it over something else is answered by "someone made it
up so it's better".

~~~
viraptor
Collectd is only one part here. Did you need a solution for inline
processing/aggregation and alerting? If not, you wouldn't run into TICK. It's
not common overall, because few environments need to go that far.

I don't get the comparison to LAMP popularity. Insects are more common than
cars too. They're different things ¯\\_(ツ)_/¯

~~~
jimktrains2
I mentioned lamp as an example of an acronym and stack that would be more
common to ask why it wasn't chosen. In retrospect that probably wasn't the
ideal way of getting to my point that tick isn't a common stack so asking why
not it because it's common seems weird.

> If not, you wouldn't run into TICK. It's not common overall, because few
> environments need to go that far.

That's my point though. The post I replied to acted as though everyone knows
of and uses it, but provided no information on what makes it a better choice
for such usecases.

------
mickeyben
We're currently using InfluxDB but maintenance is something we'd like to stop
doing on our monitoring stack.

Datadog is too expensive because of the number of hosts we have. So we're
thinking of eventually going to a hosted InfluxDB setup.

But we also want to revisit other hosted solutions. Does someone have some
experience with using Cloudwatch + Grafana? I've used Cloudwatch many years
ago and it was clearly subpar to something like Influx. Is is better nowadays?

------
tgtweak
Looking for a similar post with APM using opentracing !

------
Aaronstotle
I recognized the name almost immediately, cool to see a blog post from a
member of the monero community.

~~~
crecker
haha thanks!

------
aprdm
As mentioned by others I have found a Prometheus + Grafana setup to be pretty
straight forward to install and maintain. It also aligns with the cloud native
foundation which means that this stack easily integrates with a bunch of OSS
stack used by cloud native companies.

------
burtonator
I've done this in my company with about 100 servers using KairosDB + Grafana
and CollectD...

[https://kairosdb.github.io/](https://kairosdb.github.io/)

If I were to do it again I'd probably use Grafana and Prometheus

------
Kednicma
I'm curious why they chose CollectD and not Prometheus or another pull-based
("scraping") solution. Pulling is more compositional than pushing and requires
fewer machines to be touched when monitoring configuration is changed.

~~~
viraptor
Probably due to the size of deployment. I'd go to Prometheus in a
larger/dynamic environment for flexibility, but I'm happy using telegraf on a
home network where I configure things by hand when I'm logged into each pet
machine.

------
mavam
Does anyone have experience with an InfluxDB-based netdata setup?

When we evaluated multiple options, we found that netdata does a lot of the
heavy lifting. What are the advantages of a CollectD-based setup?

~~~
zacksh
Well, that's the beauty of Netdata, no need to configure collectors. As for
InfluxDB integration, it's fairly easy to export from Netdata using what
InfluxDB calls the line protocol format (the "plaintext interface").

[https://learn.netdata.cloud/docs/agent/exporting](https://learn.netdata.cloud/docs/agent/exporting)

------
sahoo
How does this compare to grafana, Prometheus and node monitor

------
DevKoala
I’d replace InfluxDB with Prometheus. InfluxDB has been a PITA for me.

------
9thbyte
This seems like a lot of work when you could use Zabbix.

¯\\_(ツ)_/¯

