Hacker News new | past | comments | ask | show | jobs | submit login
Monitoring your own infrastructure using Grafana, InfluxDB, and CollectD (serhack.me)
385 points by serhack_ on July 21, 2020 | hide | past | favorite | 292 comments

Echoing the sentiment expressed by others here, for a scalable time-series database that continues to invest in its community and plays well with others, please check out TimescaleDB.

We (I work at TimescaleDB) recently announced that multi-node TimescaleDB will be available for free, specifically as a way to keep investing in our community: https://blog.timescale.com/blog/multi-node-petabyte-scale-ti...

Today TimescaleDB outperforms InfluxDB across almost all dimensions (credit goes to our database team!), especially for high-cardinality workloads: https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...

TimescaleDB also works with Grafana, Prometheus, Telegraf, Kafka, Apache Spark, Tableau, Django, Rails, anything that speaks SQL...

Just wanted to say I am super impressed with the work TimescaleDB has been doing.

Previously at NGINX I was part of a team that built out a sharded timeseries database using Postgres 9.4. When I left it was ingesting ~2 TB worth of monitoring data a day (so not super large, but not trivial either).

Currently I have built out a data warehouse using Postgres 11 and Citus. Only reason I didn't use TimescaleDB was lack of multi-node support in October of last year.

I sort of view TimescaleDB as the next evolution of this style of Postgres scaling. I think in a year or so I will be very seriously looking at migrating to TimescaleDB, but for now Citus is adequate (with some rough edges) for our needs.

If you're looking at scaling monitoring timeseries data you may also wanter to consider more Availability leaning architecture (in the CAP theory sense) with respect to replication (i.e. quorum write/read replication, strictly not leader/follower - active/passive architecture) then you might also want to check out the Apache 2 project M3 and M3DB at m3db.io.

I am biased obviously as a contributor. Having said that I think it's always worth understanding active/passive type replication and the implications and see how other solutions handle this scaling and reliability problem to better understand the underlying challenges that will be faced with instance upgrades, failover and failures in a cluster.

Neat! Hadn't heard of M3DB before, but cursory poke around the docs seems like it's a pretty solid solution/approach.

My current use case isn't monitoring, or even time series anymore, but will keep M3DB in mind next time I have to seriously push a time series/monitoring solution.

Did you try ClickHouse? [1]

We were successfully ingesting hundreds of billions of ad serving events per day to it. It is much faster at query speed than any Postgres-based database (for instance, it may scan tens of billions of rows per second on a single node). And it scales to many nodes.

While it is possible to store monitoring data to ClickHouse, it may be non-trivial to set up. So we decided creating VictoriaMetrics [2]. It is built on design ideas from ClickHouse, so it features high performance additionally to ease of setup and operation. This is proved by publicly available case studies [3].

[1] https://clickhouse.tech/

[2] https://github.com/VictoriaMetrics/VictoriaMetrics/

[3] https://victoriametrics.github.io/CaseStudies.html

ClickHouse's intial release was circa 2016 IIRC. The work I was doing at NGINX predates ClickHouse's initial release by 1-2 years.

ClickHouse was certainly something we evaluated later on when we were looking at moving to a true columnar storage approach, but like most columnar systems there are trade-offs.

* Partial SQL support.

* No transactions (not ACID).

* Certain workloads are less efficient like updates and deletes, or single key look ups.

None of these are unique to ClickHouse, they are fairly well known trade-offs most columnar stores make to improve write throughput and prioritize high scaling sequential read performance. As I mentioned before, the amount of data we were ingesting never really reached the limits of even Postgres 9.4, so we didn't feel like we had to make those trade-offs...yet.

I would imagine that servicing ad events is several factors larger scale than we were dealing with.

I've been using InfluxDB, but not satisifed with limited InfluxQL, or over-complicated Flux query languages. I love Postgres so TimescaleDB looks awesome.

The main issue I've got is how to actually get data into TimescaleDB. We use telegraf right now, but the telegraf Postgres output pull request still hasn't been merged: https://github.com/influxdata/telegraf/pull/3428

Any progress on this?

There is a telegraf binary available here that connects to TimescaleDB: https://docs.timescale.com/latest/tutorials/telegraf-output-...

If you are looking to migrate data, then you might also want to explore this tool: https://www.outfluxdata.com/

I believe PromQL [1] and MetricsQL [2] are much better suited for typical queries over time series data than SQL, Flux or InfluxQL.

[1] https://medium.com/@valyala/promql-tutorial-for-beginners-9a...

[2] https://victoriametrics.github.io/MetricsQL.html

While influx is pretty bad overall it's super simple to deploy and configure unlike timescale.

It's the main reason we decided to use influx in our small team with simple enough timeseries needs

Curious what you found difficult to deploy/ configure? Is this in a self-managed context?

I was testing out both TimescaleDB/InfluxDB recently, including maybe using with Prometheus/Grafana. I was leaning towards Timescale, but InfluxDB was indeed a lot easier to quickly boot a "batteries included" setup and start working with live data.

I eventually spent a while reading about Timescale 1 vs 2, and testing the pg_prometheus[1] adapter and started thinking through integrating its schema to our other needs then realizing it's "sunsetted" and then reading about the new timescale-prometheus[2] adapter and reading through its ongoing design doc[3] with updated schema that I'm less a fan of.

I finally wound up mostly-settling on Timescale although I've put the Prometheus extension question on hold, just pulling in metrics data and outputting with ChartJS and some basic queries got me a lot closer to done for now. Our use case may be a little odd regardless, but I think a timescale-prometheus extension with a some ability to customize how the data is persisted would be quite useful.

[1] https://github.com/timescale/pg_prometheus

[2] https://github.com/timescale/timescale-prometheus

[3] https://docs.google.com/document/d/1e3mAN3eHUpQ2JHDvnmkmn_9r...

(TimescaleDB engineer) Really curious about finding out more about your reservations about the updated schema. All criticisms welcome.

Thanks, as I say our use case may be too odd to be worth supporting, but effectively we're trying to add a basic metrics (prom/ad-hoc) feature to an existing product (using Postgres) with an existing sort of opinionated "ORM"/toolkit for inserting/querying data.

Because of that and the small scale required, the choice of table-per-metric would be a tough fit and I think a single table with JSONB and maybe some partial indexes is going to work a lot better for us. It would just be nice if we could somehow code in our schema mapping and use the supported extension, but I get it may be too baked-into the implementation.

Anyway, overall we're quite happy with TimescaleDB!

I see, that makes a whole lot of sense. This is an interesting use-case. It actually may be possible to use some of the aggregates provided by the extension even with a different schema. If you are interested in exploring further, my username on slack is `mat` and my email is the same @timescale.com

I found this to be a funny, subtle insult. :-)

"Why is it difficult, because you're self managed?"

I think it's a honest question. Maybe they want to find out if setting up needs a bit of us work or simplifying.

No intention to insult - just trying to get more context :)

I know - I just read it as such and chuckled.

What's wrong with influx? I use it and like it, albeit for hobby-level projects.

Influx is so good for hobby-level projects. So good!!

For serious applications, it doesn't cut it. Trying to do more than a few TB a day is a waste of time outside of enterprise, which ain't cheap.

I plopped VictoriaMetrics in place of Influx for my cases and haven't even had a single hiccup.

As with any technology, the right tool for the job is highly dependent on what the job is.

InfluxDB may have slightly big RAM requirements when working with high number of time series [1].

[1] https://medium.com/@valyala/insert-benchmarks-with-inch-infl...

If you want the best of both worlds, then try VictoriaMetrics. It is simple to deploy and operate and it is more resource-efficient comparing to InfluxDB. More on this, it supports data ingestion over Influx line protocol [1].

[1] https://victoriametrics.github.io/#how-to-send-data-from-inf...

For small and medium needs, InfluxDB is a delight to use. On top of that, the data exploration tool within Chronograf is one-of-a-kind when it comes to exploring your metrics.

If you are looking for a vendor to host, manage, and provide attentive engineering support, check out my company: https://HostedMetrics.com

One of the biggest quirks that I had bumped up against with TimescaleDB is that it's backed by a relational database.

We are a company that ingests around 90M datapoints per minute across an engineering org of around 4,000 developers. How do we scale a timeseries solution that requires an upfront schema to be defined? What if a developer wants to add a new dimension to their metrics, would that require us to perform an online table migration? Does using JSONB as a field type allow for all the desirable properties that a first-class column would?

90M datapoints per minute means 90M/60=1.5M datapoints per second. Such amounts of data may be easily handled by specialized time series databases even in a single-node setup [1].

> What if a developer wants to add a new dimension to their metrics, would that require us to perform an online table migration?

Specialized time series databases usually don't need defining any schema upfront - just ingest metrics with new dimensions (labels) whenever you wish. I'm unsure whether this works with TimescaleDB.

[1] https://victoriametrics.github.io/CaseStudies.html

Can you refute the claim that TimescaleDB uses such a huge disk space that is 50 times larger than other time series database?

I see no reason to use it over VictoriaMetrics.


Yes, I can refute that claim. TimescaleDB now provides built in compression.


You didn't refute it technically.

Are you saying the compression shrinks the data down to 2% on average?

If the compression only makes the data 10 times smaller (I think I'm being generous with that ratio), it's still 5 times larger than the others.

Completely anecdotal, but I went from 50B to 2B per point using the compression feature. Mind you this is for slow moving sensor data collected at regular 1 second intervals.

Prior to the compression feature, I had the same complaint. Timescale strongly advocated for using ZFS disk compression if compression was really required. Requiring ZFS disk compression wasn't feasible for me.

I don't consider TimescaleDB to be a serious contender as long as I need a 2000 line script to install functions and views to have something essential for time-series data like dimensions:



Because TimescaleDB is a relational database for general-purpose time-series data, it can be used in a wide variety of settings.

That configuration you cite isn’t for the core TimescaleDB time-series database or internals, but to add a specific configuration and setup optimized for Prometheus data, so that users don’t need to think about it and TimescaleDB can work with the specific Prometheus data model right out of the box.

Those “scripts” automatically setup a deployment that is optimized specifically to have TimescaleDB serve as a remote backend for Prometheus without making users think about anything. Some databases would need to write a lot of specific internal code for this; TimescaleDB can enable this flexibility through configuring proper schemas, indexes, and views instead (in addition to some specific optimizations we've done to support PromQL pushdown optimizations in-database written in Rust).

This setup also isn't something that users think about. Users install our Prometheus stack with a simple helm command.

$ helm install timescale/timescale-observability

The Timescale-Prometheus connector (written in Go) sets this all up automatically and transparently to users. All you are seeing is the result of the system being open-source rather than having various close-source configuration details :)

- https://github.com/timescale/timescale-observability

- https://github.com/timescale/timescale-prometheus

(TimescaleDB engineer here). This is a really unfair critique.

The project you cite is super-optimized for the Prometheus use-case and data model. TimescaleDB beats InfluxDB on performance even without these optimization. It's also not possible to optimize in this way in most other time-series databases.

These scripts also work hard to give users a UIUX experience that mimicks PromQL in a lot of ways. This isn't necessary for most projects and is a very specialized use-case. That takes up a lot of the 2000 lines you are talking about.

> TimescaleDB beats InfluxDB on performance even without these optimization

Would you mind sharing the schema used for this comparison? Maybe I missed it in your documentation of use-cases. When implementing dynamic tags in my own model, my tests showed that your approach is very necessary.

You can look at what we use in our benchmarking tool https://github.com/timescale/tsbs (results described here https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...).

Pretty much it's a table with time, value, tags_id. Where the tags table is id, jsonb

>The project you cite is super-optimized for the Prometheus use-case and data model.

That may be true, but: Instead of figuring out how to meet the user's needs you're going to say the user is wrong?

That's not what he said and is kinda rude to be honest.

What I consider to be said is that they optimize for extendability ( not just one use-case) and that those 2000 lines do a lot of things.

Including replicating optimizations that are are not easy to achieve out of the box in their and other solutions. But they mostly found a way ;)

It's about as rude as telling someone their criticism is misplaced in that way. You've added more words to why you believe that, but that doesn't change the original sentiment.

You can say you don't want to support that usecase but someone inquired about whether you would. You said no, they're wrong to compare the two. Sounds a lot like "you're holding it wrong."

(Not affiliated with TimescaleDB, just trying to understand your critique)

TimescaleDB is a Postgres extension, which means they have to play PG's rules, which rather implies a complicated mesh of functions, views, and "internal-use-only" tables.

But once they're there, you can pretty much pretend they don't exist (until they break, of course, but this is true for everything in your software stack).

Is your complaint targeting the size of these scripts, or their contents? Because everything in there appears pretty reasonable to me.

My problem is what these scripts do. When working with timeseries you want to be able to filter on dimensions/tags/labels.

TimescaleDB doesn't support this out of the box, the reason being Postgres' limitations on multi column indexes when you have data types like jsonb. What they have to do to work around this is to have a mapping table from dimensions/tags/labels which are stored in a jsonb field to integers which they then use to look the data up in the metrics table.

I would expect that a timeseries database is optimized for this type of lookup without hacks involving two tables.

(TimescaleDB engineer) Actually, the reason we break out the Jsonb like this is because json blobs are big and take up a lot of space and so we normalize it out. This is the same thing that InfluxDB or Prometheus do under the hood. We also do this under the hood for Timescale-Prometheus.

If you have a data model that is slightly more fixed than Prometheus, you'd create a column per tag and have a foreign-key to the labels table. That's a pretty simple schema which would avoid a whole lot of complexity in here. So if you are developing most apps this isn't a problem and filtering on dimensions/tags/labels isn't a problem. Again, this is only an issue of you have very dynamic user-defined tags, as Prometheus does. In which case you can just use Timescale-Prometheus.

I'll echo the sibling comment by the TDB engineer, having read and contributed to similar systems myself.

Most of these custom databases (Influx) use techniques similar to TDB's under the hood, in order to persist the information in an read/write optimized way; they just don't "expose" it as clearly to the end-user, because they aren't offering a relational database.

A critical difference between the two, is that custom databases don't benefit from the fine-tuned performance characteristics of Postgres. So even though they're doing the same thing, they'll generally be doing it slower.

Which kinda confirms the parent’s issues... InfluxDB is really easy to deploy and forget. With TimescaleDB you should be ready to know ins-and-outs of PG to secure and maintain correctly. Sure, for scaling and high loads TDB might be good but InfluxDB is easier and suitable for most loads and maintainability.

> With TimescaleDB you should be ready to know ins-and-outs of PG to secure and maintain correctly.

Gotcha, this makes sense. To this, I'd pose the question: is the same not true for Influx? (i.e. With IFDB, you should be ready to know its ins and outs to secure and maintain it correctly). I guess I think about choosing a database like buying a house. I want it to be as good in X years as it is today, maybe better.

From this perspective, PG has been around for a long time, and will continue to be around for a long time. When it comes time to make tweaks to your 5 year old database system, it will be easier to find engineers with PG experience than Influx experience. Not to mention all of the security flaws that have been uncovered and solved by the PG community, that will need to be retrodden by InfluxDB etc.

Anyways, it's just an opinion, and it's good to have a diversity of them. FWIW, I think your perspective is totally valid. It's always interesting to find points where reasonable people differ.

Conversely, because of its larger scope and longer history, there are a lot more ins and outs for PG.

Think if InfluxDB is "move-in ready" when buying a house. Sure you won't know the electrical and plumbing as well as you will a "fixer-uppper" in the long run, but you'll have a place to sleep a lot faster and easier.

I think it's a good analogy; very reasonable people come to different conclusions on this problem IRL too.

I prefer older houses (at least pre-80s). They have problems that are very predictable and minor. If they had major problems, they would show them quite clearly in 2020.

My friends prefer newer properties, but this is an act of faith, that the builders today have built something that won't have severe structural/electrical/plumbing problems in 20 years.

Of course, if you only plan on living some place for 3-5 years, none of this matters; you want to live where you're comfortable.

There may be some extra install steps with Timescale, but that is balanced by having data in a platform accessible by a very well known API.

There are always tradeoffs, but in cases like mine, I find the basis of the platform - PostgreSQL - to provide a very useful set of features.


The linked scripts are not part of the base installation yet TimescaleDB claims, for example, Prometheus support.

If you design your own schema and want to filter on such a dynamic tag field you have to replicate what they have implemented in these scripts. If you look at the code you'll see that this isn't exactly easy and other products do this out of the box.

I use TimescaleDB and have suffered from this, otherwise I wouldn't be talking here. Another problem is that these scripts are not compatible with all versions of TimescaleDB so you have to be careful when updating either one.

I’m really sad to see this kind of tone more and more in HN comments...

I find it sadder that people talk such glaring nonsense and are taken seriously. Extremely sad.

IMO, the HN guidelines are pretty clear and reasonable regarding these situations.

> Be kind. Don't be snarky. Have curious conversation; don't cross-examine. Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.

> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.

The original poster was making an assertion that TDB is too complex. As the guidelines suggest, it doesn't help the conversation to assume that they arrived upon this conclusion unreasonably, and that they're irrational or otherwise unwilling to have an open conversation about it.

We have to assume that they are open to a discussion, despite whatever we might interpret as evidence to the contrary.

This thread is about infra monitoring, where any of the potential performance differences probably don't matter at all. If it does, show us.

I don't work for any time series database provider.

Why performances shouldn't matter in this case? The more instances, the more timeseries and the more datapoints you have, the more you care for speed when you need to query it to visualize for example WoW changes in, say, memory usage.

Consider the case at hand, involving streaming charts and alerts. There will be zero perceptible difference in the streaming charts regardless of what database is used. Alerts won't trigger for whatever millisecond difference there may be, and I don't think that this matters to any developer or manager awaiting the 3am call.

Time series database performance is not only about the reads.

Obligatory heads-up: parts of TimescaleDB (including the multi-node feature) come with no right-to-repair. See https://news.ycombinator.com/item?id=23274509 for more details.

We are currently working on revising this to make the license more open. Stay tuned :-).

(If you want to provide any early feedback, please email me at ajay (at) timescale.com)

TimescaleDB and my team is using it, but one significant drawback compared to solutions like Prometheus are the limitations of continuous aggregations (basically no joins, no order by, no window functions). That’s a problem when you want to consolidate old data.

(TimescaleDB engineer) we hear you and are working on making continuous aggregations easier to use.

For now, the recommended approach is to perform continuous_aggregates on single tables and perform joins, order by, and window when querying the materialized aggregate rather than when materializing.

This often has the added benefit of often making the materialization more general so that a wider range of queries can use it.

Thanks for the advice! Makes sense. We are doing something similar (aggregate on single table and join later). But still looking a solution to compute aggregated increments when there are counter resets.

Meant "TimescaleDB is great and my team is using it"...

How does multi node works with Postgres? Does TimescaleDB create its own Raft layer on top and just treat Postgres as dumb storage?

A TimescaleDB engineer here. Current implementation of database distribution in TimescaleDB is centralised where all traffics go through an access node, which distributes the load into data nodes. The implementation uses 2PC. Abilities of PostgreSQL to generate distributed query plans are utilised together with TimescaleDB optimisations. So PostgreSQL is used not just a dumb storage :)

Won't that access node be a single point of failure then? Just trying to learn more.

The access node is replicated using streaming replication and is thus not a SPOF.

> high-cardinality workloads

Does this mean if you were using it with Prometheus you could get around issues with high cardinality labels?

We're using TimescaleDB + Grafana for visualising sensor data for a Health Tech product. No complaints so far.

After a few years in the industry of systems engineering and administration I think that "no complaints so far" is one of the best compliments a software can receive.

I've been doing monitoring of our ~120ish machines for 3-4 years now using Influx+Telegraf+Grafana, and have been really happy with it. Prior to that we were using collectd+graphite and with 1 minute stats it was adding some double-digits %age utilization on our infrastructure (I don't remember exactly how much, but I want to say 30% CPU+disk).

Influxdb has been a real workhorse. We suffered through some of their early issues, but since then it's been extremely solid. It just runs, is space efficient, and very robust.

We almost went with Prometheus instead of Influx (as I said, early growing pains), but I had just struggled through managing a central inventory and hated it, so I really wanted a push rather than pull architecture. But from my limited playing with it, Prometheus seemed solid.

It's so much easier to write incorrect/misleading queries in influxql than in promql. And you can't perform operations between two different series names in influxdb, last I looked. That makes it impossible to do things like ratios or percentages unless you have control over your metrics, and structure them the way influx likes. Also, no support for calculating percentiles from Prometheus histogram buckets.

You can do these things (and much, much more) with the new Flux language

I just wanted to echo this sentiment. Influx had its fair share of issues a few years ago (v0.8 migration, changing storage engines, tag cardinality issues), but the latest v1.x releases have been solid. I have been using the TIK stack (I use Grafana instead of Chronograf) for monitoring several dozen production-facing machines for 2 years now without a single issue, which I would very much count as a win.

I just hope they learned their lesson for the v2.0 release...

> I really wanted a push rather than pull architecture

Then try VictoriaMetrics - it supports both pull and push (including Influx line protocol) [1], it works out of the box and it requires lower amounts of CPU and RAM when working with big number of time series (aka high cardinality) [2].

[1] https://victoriametrics.github.io/#how-to-import-time-series...

[2] https://medium.com/@valyala/insert-benchmarks-with-inch-infl...

Gotta say though, having rolled grafana and prometheus and such on my own plenty of times before, if you are a startup and can afford Datadog, use Datadog.

I just costed a datadog deployment (based on your comment) and it would cost me my yearly salary every month.

No thanks. :/

We're moving from Datadog to Prometheus for this reason

Wait, you're saying it costs 12x as much as your salary? That doesn't seem right... the company I'm at uses it pretty heavily and we're at about 1x of a FTE salary (and it's still super worth it)

World’s salaries aren’t only US based :) Datadog for a smaller Western Europe startup is going to cost more than their devs salaries.

Western salaries are not that low and true costs to the employer is typically double the perceived salary. That means we're talking tens of thousands of euros, so thousands of hosts (list price is certainly negotiable at this scale).

If they've got a thousand of hosts, the costs of the infrastructure itself must dwarf the salary of any developer by orders of magnitude, the salary of a developer is simply irrelevant when it comes to acquiring software/hardware.

> true costs to the employer is typically double the perceived salary.

This is true, but my comment was an offhanded way to say that my "salary" (as in, the one on my contracts and the one I "see") is less than a month of Datadog for our number of hosts.

As for the rest of your comment, I wish it was true.

Developer salaries outside of the capitals is quite low in Europe, and even inside the capitals only go to "near double"

So, instead of 12x it becomes 6x developer costs per annum, which is a fair whack of money.

For me to justify spending "3-6" peoples worth of money it had better save "3-6" peoples worth of time.

> For me to justify spending "3-6" peoples worth of money it had better save "3-6" peoples worth of time.

Well, it does in my experience, especially if you have to handle 2000+ hosts, that's some serious infra there, need serious tooling.

May I ask which country is it?

I think another thing to keep in mind is that not necessarily hardware infrastructure size equals to profitability.

Some industries need a lot of hardware because they crunch a lot of data but they aren't software companies. Think Computer graphics rendering.

Paying a FTE salary for software is crazy for them. I would love to see a ration of developers/infrastructure per industry/company.

Sweden! :D

You're right, I was narrow-minded in my thinking there, thanks for pointing it out :)

Can you say something vague about your deployment scale?

2,500~ compute instances.

Looking at the pricing page the cost $15/instance so $37,500

Middle-of-the-road developer salary is like $35k in most of Europe, outside of the capitals.

Although it does say: "Volume discounts available (500+ hosts/mo). Contact us." at the bottom, so I guess 500 is a lot.

2500 instances could be millions per months in AWS costs. The smallest instances with some disks and bandwidth fees can push 100k a month.

Spending a fraction of that to monitor that sort of infrastructure is absolutely justified. I can tell you from experience that datadog gives discount even for 100+ hosts, I don't know what they can do for 2500, but if it were me I wouldn't accept anything less than 50% off.

Honestly you need to forgot about your salary, it's irrelevant when it comes to running a company. Imagine a driver in a shipping company deciding to deliver on a scooter rather than a truck because the truck is worth more than his yearly salary.

It could also be less than $100k a month (e.g. 2500 c5a.large with a 1-year reservation). At that point you'd wonder why your monitoring bill was 40% of your compute bill.

Also, of course their salary is relevant. The cost of an engineer's time is an important factor to consider when making build vs buy decisions. Usually it's one that argues in favor of "buy", but not always.

2500 c5a.large is $172k per month, or $102k with one year reservation full upfront.

Add 10% support fees, EBS storage for the OS, bandwidth fees and it's quite a bit more.

It's not millions, but it's the many multiples of hundreds of thousands. (and it's mainly GCP/bare metal).

I guess the point I am driving at here is that there's such a thing as "business critical costs" (IE: can we ship our product or not) which is the majority of infra costs we have today, and then there's "optimisation costs".

Usually when we discuss things like optimisation costs its along the lines of: "Will this product save us enough time to justify it's expense". Often, sadly, the answer is no.

Terraform Enterprise is an example of a time where we said: Yes. -- because the API allows us to deploy CI/CD jobs which provision little versions of our infrastructure, saving us many man-days of time in provisioning and testing every year.

As eluded to in the sibling thread, there's almost no way that we can save 3 or more peoples worth of time every year, we're 3 people right now and we have metrics collection, log tracing and alerting already. -- so it's a hard sell to the business types.

Great choice on GCP! It's probably half of the price as AWS for the same thing.

Monitoring and logging are business critical. It's an integral part of infrastructure and it is very normal to spend 10% there. It's really not possible to operate stably and efficiently at a large scale like that without a trove of tooling.

Tools usually justify their costs by allowing to optimize the infra and helping to prevent/fix outages, though not all companies care about stability or hardware costs.

And it's not a choice of free vs paid. open source software costs a lot of money too, pairs of large instances to run it don't come cheap, they're probably more than a salary too if the company wants to have any sort of redundancy or geographic distribution.

May I ask what do you have for logging? I guess you must be screaming in horror at the price of elasticsearch/kibana/splunk :D

The community versions of ELK, Zabbix, InfluxDB and grafana.

Zabbix is the weak link here for sure, but the monitoring is quite comprehensive.

Or at that point you can have a datacenter/colo and not spend anywhere near what you're thinking :)

As a counterpoint: I have been rolling my own Grafana/Prometheus for many years for my startup and it's been pretty trivial

Same here. DataDog would cost me at least 10x.

How are you running it? On AWS? What’s your infra look like.

GCP with Docker Swarm, scheduled on the same node (n2-standard-4). Let me know if any other questions

Does "scheduled on the same node" imply you are running everything on one node?

Nope, just that Prometheus and Grafana are running together on the same node. My entire infrastructure is multi-node however.

How much data are you sending and ingesting?

I don't have any hard numbers but quite a bit. Everyday we process millions of background jobs, thousands of database queries per second, etc. and stats are collected and sent to Prometheus for all those.

This is good advice, however you will also want to make sure you have a plan to get off datadog when you grow. Datadog is one of the easiest to use and most comprehensive out of the box. But it gets really expensive as you begin to scale up and add servers, cloud accounts, services etc....

At a certain scale, rolling your own monitoring and alerting becomes cost effective again as Datadog begins to charge an arm and a leg. I've seen Datadog bills that could easily pay for 2 full time engineers.

> I've seen Datadog bills that could easily pay for 2 full time engineers.

Which means the companies are running thousands of hosts. So they definitely need both datadog and a full team of sysadmin/devops/SRE to handle that infrastructure.

On the contrary, the bigger you get, the more you need Datadog to scale. Once you grow big enough, you can make volume deals with them, you don't pay list price. I'd much rather pay them than have a single engineer have to spend any time managing infrastructure that isn't core to our product, especially since Datadog will always do it better.

I respectfully strongly disagree. At a certain scale when you can afford to have a full-time engineer working on it, open source tooling will give you a better, cheaper and more flexible solution since it's easy to customize. Datadog is good for the mainstream cases but it's not all that flexible for unusual or edge cases.

We migrated customers off DataDog for this exact reason.

Rolling a custom Prometheus / Grafana / Alertmanager setup is not hard at all, more powerful, and it's much easier to do it right from the start.

DataDog per host pricing can be very expensive. Metrics are provided by many platforms. If you need logs too, you may look at Sumo Logic which got way cheaper metrics in typical use case.

Disclaimer: I work at Sumo Logic.

It's expensive for good reason, it's the best out there.

That's debatable. I prefer Grafana.

Like any SaaS dev tool, when at scale, you negotiate and pay a fraction of the list price.

It's meaningless to look at the price of Datadog@5 hosts -- at 500 or 5000, you're paying a completely detached number from the website list price, likely a small fraction.

Based on my experience at two companies with thousands of hosts on Datadog, this is not true. You'll only be able to negotiate a small discount.

You'll also experience their habit of launching new features, waiting a while for customers to adopt them, then starting to charge extra for them.

Don't get me wrong, Datadog has great products. But they're also great at extracting money from their customers.

Which blows away the reason a lot of people/teams/companies like SaaS. Because there is no negotiation, no sales requisitions, no long lead time while they come up with a quote. You see the price you pay the price you get the service, same as anyone else.

Every SaaS provider offers discounts based on length of commitment and volume of spend. Most will say look at AWS, but even there you have list prices and private/bulk pricing. Throw in EDPs, negotiated credits, savings plans, reservations, etc and you're nowhere near list prices.

This is not my experience at all. We've got it setup and monitoring all sorts of things. After the initial setup, the only reason we've needed to touch it is when we've introduced new things and wanted to update the config. In fact, data dog was far more of a pain, and far less useful than prometheus has been.

I'd be interested to know a bit of detail as we're looking into Grafana/Promethus a bit.

Prometheus is a very needy child in terms of data volume and hardware resources. Running it is at least one engineers' full time job- if you're a startup, you can outsource monitoring for a tiny fraction of the price, then move to Prometheus later if you are successful.

Please elaborate - how is it one engineer's full time job?

We run Prometheus in production and this hasn't been our experience at all.

A single machine can easily handle hundreds of thousands of time series, performance is good, and maintaining the alerting rules is a shared responsibility for the entire team (as it should be).

I think the parent's complaint is a function of how your engineering org "uses" prometheus.

If you use it as a store for all time series data generated by your business, and you want to have indefinite or very-long-term storage, managing prometheus does become a challenge. (hence m3, chronosphere, endless other companies and tech built to scale the backend of prometheus).

IMO, this is a misuse of the technology, but a lot of unicorn startups have invested a lot of engineering resources into using it this way. And a lot of new companies are using it this way; hence the "one engineer's FT job".

I'd agree; I'm at a large corp that has a need to store our data for a very long term. If we were using Prometheus as an ephemeral/short term TSDB to drive alerting only, it would be really easy.

For reference: A single machine can push 5k metrics so you're saying a single prometheus instance can easily serve 20s of hosts. lol

False in my experience. Full-time job? After the initial learning curve, a simple 2x redundant Prometheus poller setup on can last for a long time. Ours lasted for 30,000,000 timeseries until encountering performance issues.

After that, we needed some more effort to scale out horizontally with Thanos, but again, once it's set up, it maintains itself.

My company's Prometheus setup was super easy, one $10/mo box. About 1 week of fiddling all the exporters and configs but now it just runs and has for months.

You can be small with Prometheus and grow into needing an FTE for it - w/o having the migration hurdle of moving out-source to in-source

I approached InfluxDB since it looked promising. It did actually served its purpose when it was simple and Telegraf was indeed handy. Now that I have more mature requirements I can't wait to move away from it. It gets frozen frequently, it's UI Chronograph is really rubbish, functions are very limited and managing continuous queries is tiresome.

I'm now having better results and experience storing data in ClickHouse (yes, not a timeseries dB).

From time to time I also follow what's coming in InfluxDB 2.0 but I must confess that 16 betas in 8 months are not very promising.

It might just be me.

I have also had scalability and reliability issues with influx. And full of silly limitations like tagset cardinality and not being able to delete points in a specific retention policy etc. Am moving to classic rdbms and timescaledb.

Indeed, cardinality limit is a very painful aspect which blocked us since day one for certain metrics. I confess, at the time I didn't know any better. Now I wouldn't recommend it under any circumstances.

> From time to time I also follow what's coming in InfluxDB 2.0 but I must confess that 16 betas in 8 months are not very promising.

Don't read too much into that, it's more a result of wanting to get testable releases out early and temporarily redirecting engineering resources to take advantage of opportunities that arose in that timeframe than anything having to do with the code of 2.0 itself.

Our Cloud 2 SaaS offering is already running the 2.0 code in production (albeit with changes to support it being deployed as a massively multi-tenant service)

Fair enough. Good luck then with 2.0. I just wish the value proposition is strong enough to inspire users like me to try it again although much has to change for that to happen. I just wish I didn't have the feeling that the free version is just a bait for the SaaS. Anyhow, as said, good luck.

The OSS version is going to fill a vital role that SaaS just can't do, and we view it as a critical component in what InfluxDB provides. I don't think we're ever going to get away from needing on-prem and on-device deployments when dealing with time-series data, especially for emerging IoT/Edge use cases.

No, the company recognizes that the success of the SaaS is inevitably tied to the success of the OSS product. The only reason we were able to give the SaaS more focus recently is because we already had a working OSS product in the 1.x line that was meeting the needs of existing customers that we were and are continuing to invest in.

I must say that if scalability is not part of OSS then it inevitably smells fishy to me. I am willing to pay to avoid the pain of maintaining or designing clusters but not because it's the only way to scale up. There are too many options that allow to scale with OSS version now it's hard to justify it. It's still a no-go for me.

Again, just take it as one person's opinion. I can't even grasp how complicated and challenging is to run what you offer. This is strictly user's opinion.

Have you tried Victoria Metrics? It originallyl started as a ClickHouse fork for time series but is now a re-write keeping some of the same principals.

Yes, we use it as long term retention for Prometheus. I was also tempted to use it as stand-alone dB but never got to it. Do you use it as such?

Really, REALLY tried to love InfluxDB. But its systems requirements, performance, and features are poor compared to things like TimescaleDB.

Influx is really shitting the bed with how they're handling the InfluxDB 2.0 release. The docs are a mess, and the migration tool seems like the result of a weekend hackathon. They're leaving a lot of customers with long term metrics in a tough spot.

If you're thinking about using Influx for long term data storage, look elsewhere. The company continuously burns customer goodwill by going against the grain, and bucking the ecosystem by trying to control the entire stack.

(I'm the VPE at Influxdata).

I appreciate this sentiment. We've been focused on a building a SaaS version of Influxdata and are committed to a paired open source version of that. The open source version has been lagging as we work on the SaaS side.

However, we are committed to shipping a GA version of the OSS 2.0 stack around the end of Q3 that offers an in-place data migration capability from 1.x OSS.

We've spoken about this publicly in other forums. You can google "influxdays London talks" to hear Paul Dix (CTO/Founder) talk more about our OSS plans.

Classic "two masters" problem.

This was from the CTO last week: "This work won't be landing anywhere until sometime next year and it'll be landing in our Cloud 2 offering first." So the OSS is definitely a second-class citizen. And now that they've dropped all DevRel activity, don't expect much attention for OSS Developers and users.

Quite the contrary, we're finishing up packaging of the OSS version now to make it as easy as possible for you to get it and we're assembled a new team focused on getting OSS 2.0 released.

We continue to support all users, including OSS users, in our public Slack and Discourse, as well as Github. We have not "dropped all DevRel" activity.

¯\_(ツ)_/¯ Influx laid off pretty much the entire DevRel team in the last 6 months, but ok.

This is true, and has certainly been painful for us and puts more work on those of us who are still working on community and devrel, but we're still doing it.

In fact we've got one of our Developer Advocates doing a live broadcast right now over at https://www.influxdata.com/time-series-meetup/virtual-2020

Well, guess I know to never use Influx as a startup

Why is that? It takes minimal effort to get it up and running and you can either self-host or use the SaaS offering on any of the major clouds. There's even a free tier on the SaaS your startup can use that won't cost you a dime until your usage becomes significant.

Because the VPE sounds like the corporate sort of person who would not prioritize the things I'd want them to as a customer.

That kinda hurts.

There are 100k's of happy InfluxDB users -- untold millions of completely open source Telegraf deployments doing meaningful work for people -- integrated into our products and our competitors' products. We make the vast majority of our codebase public under liberal OSS licenses.

And we'll keep listening and learning how to do more.

... off to see if I can change my internal slack handle to @corporatetool. Unless there's a policy against that ;-)

Hey, yeah, sorry, I've been meaning all day to go back and fix that. Really wasn't right.

Just struck a nerve that reminded me of some leaders at a previous gig who made things a little less sensible.

I apologise, and you definitely don't deserve it :-(

I don't think anybody who knows Ryan would describe him as "the corporate sort of person" :)

He was giving you an honest assessment of what was going on, not sugar-coated or wrapped in corporate speak. The work he described on our SaaS offering was directly tied to what our current and potential customers wanted.

Your issue is that they are prioritizing paying customers over giving you a free product.

I don't think that's fair given they are still a startup.

Just to add to what Ryan already said, we don't plan on dropping support for InfluxDB 1.x anytime soon, and will continue to improve the migration path from 1.x to 2.0, so you won't have to upgrade until it's right for you.

Now of course our goal is to help everyone upgrade to 2.0 and beyond, but we know that we made a lot of changes and improvements in 2.0, this isn't a minor upgrade. We will focus first on what we need to do to help 80% of our users upgraded, then the next 80%, and so on, until we've got you all covered.

Meanwhile new users can benefit from starting off on 2.0 as soon as it's available (or get the beta which is already out) and not have to wait.

(Source: I'm the Community Manager for InfluxData)

Im on the same boat, I tried very hard. Went through a few version upgrades with data incompatibility and other issues. Used it for personal projects as well as for different projects at work. I truly regret ever recommending it to different teams at work.

I ended up moving away from it to TimescaleDB and I’m pretty happy so far.

I use InfluxDb on number of big gov production services, and never had a single issue with it. I use it for all metrics (applicative and infrastructure).

Its easy to deploy and its x-platform. Not sure what are all the comments here about - maybe for really huge loads which I don't have experience at, I usually use separate influxdb per service.

We had major issues with scaling InfluxDB. We use clickhouse (graphite table engine) now and it is more than order of magnitude more resource efficient.

What about storage? We are running influxdb and we are looking for alternative. But a point where Influx is good is storage.

TimescaleDB provides such community features as compression, which allows to save space a lot, and continuous aggregates, which gives performance and save space if used together with retentions.

Tried VictoriaMetrics?

The author compares with other storage engines comprehensively from time to time.


You mean storage efficiency? Seems that unless the aggregation is happening before being stored, it would be unfair to compare influxdb with other databases that are tasked with storing per-record granularity.

InfluxDB is pretty good if you don't need to do any advanced querying like grouping by month or formulas. Its strong points are low diskspace footprint and very fast queries even over long periods of time.

Have you looked at Flux? You can do some really incredible things with it, both inside of queries and also in tasks/alerts. Check out https://www.influxdata.com/blog/anomaly-detection-with-media... for an example

Yes I have looked at it and it's unusable for my application because it is orders of magnitude slower than InfluxQL. Fairly logical if you think about how Flux works, it's very hard to optimize a query engine with a language where every step is supposed to be discrete.

There are actually multiple optimizations to Flux already underway. The language runtime can "push down" some of these operations to the storage layer where they can be performed more efficiently.

Is that really true? Do you have any reference for that, I’m curious how different it is. FWIW I use InfluxDB and haven’t really thought it was a huge resource hog.

InfluxDB may require big amounts of RAM when working with high cardinality data. It works OK with up to a million of time series. Then it starts eating RAM as crazy [1]. How many time series does your InfluxDB setup contain?

[1] https://medium.com/@valyala/insert-benchmarks-with-inch-infl... .

Does any of those matter for small-scale monitoring (say <= 10 hosts)? I've got influx sitting pretty much idle in that kind of environments and any data preprocessing / collation in grafana works just fine.

Arguably system requirements matter more for small scale deployments. I shouldn't need a server with multiple SSD volumes and 8GB+ ram just to monitor a couple raspberry pi's.

EDIT: okay, I get it. You don't need 8GB ram. I was just going by the hardware requirements in their docs: https://docs.influxdata.com/influxdb/v1.8/introduction/insta...

... and it doesn't need that. I'm running telegraf+influx+grafana along with other services on a small rpi without issues.

Running TICK stack here collecting metrics from around a dozen different hosts (more if you count network devices like switches) to a central server with a default retention policy on spinning rust. InfluxDB process is currently sitting at 600MB RSS.

You don't need 8GB ram to run a small scale deployment of influxdb.

I'm sure this is true, but the documentation does explicitly say "Each machine should have a minimum of 8GB RAM" with absolutely no caveats about scale, so I think people can be forgiven for taking a quick look and moving on because they don't want to provision that.

That's a valid point, I've brought it up internally to improve the docs to explain what type of load this requirement is expecting. If you'd like you can create an issue for this in Github (https://github.com/influxdata/docs.influxdata.com/issues/new) to be involved in the discussion and change.

While we're working on this page, also check out https://docs.influxdata.com/influxdb/v1.8/guides/hardware_si... which gives a better breakdown of resources requirements based on expected load

Sure, if you're really not doing anything with it, then any database would work fine for you.

I still can't find any alternative to the old, RRD-based Munin. It is so simple. You want to add a new server to monitor? Just install there the node part, enable any required additional plugins (just by creating a couple of soft-links), add one-line configuration to the main server with the new node's IP address, and you are done.

Also, the aesthetics of the UX, you see all the graphs in one single page[1], no additional clicks are required - a quick glance with a slow scroll and you can see if there were any unusual things during the last day/week.

[1] - publicly available example, found by googling - https://ansible.fr/munin/agate/agate/index.html

That's more steps than using InfluxDB and Telegraf.

Check out: https://github.com/influxdata/community-templates/tree/maste...

[Offtopic (a bit)] Lots of you are talking about metric monitoring. But do you have recommendations when it comes to (basic) security Monitoring? I would usually go for the Elastic-Stack for that purpose, especially because Kibana offers lots of features for security monitoring. But I feel like these stacks are so big and bloated. I basically need something to monitor network traffic (Flows and off-Database retention of PCAPs) and save some security logs (I'm not intending on alerting based on logs, just for retention). But being able to have a network overview, insight into current connections (including history) is a very useful thing. Can anybody recommend something, that's maybe a bit lighter than an entire Elastic-Stack?

I think Gravwell (https://gravwell.io) might be what you're looking for--but I work for Gravwell so I may be biased! If I can be forgiven a short sales pitch, we've built a format-agnostic storage & querying system that easily handles raw packets, Netflow records (v5, v9, and IPFIX), collectd data, Windows event logs, and more. You can see some screenshots at https://www.gravwell.io/technology

We have a free tier which allows 2GB of data ingest per day (paid licenses are unlimited) which should be more than enough for capturing logs and flows. The resources needed to run Gravwell basically scale with how much data you put into it, but it's a lot quicker to install and set up than something like Elastic, in our opinion (https://www.gravwell.io/blog/gravwell-installed-in-2-minutes)

Edit: it's currently a bit roll-your-own, but we're really close to releasing Gravwell 4.0 which enables pre-packaged "kits" containing dashboards, queries, etc. for a variety of data types (Netflow, CoreDNS logs, and so on)

When you say

> Gravwell is developed and maintained by engineers expert in security and obsessed with high performance. Therefore our codebase is 100% proprietary and does not rely on open source software. We love open source, but we love our customers and their peace of mind a lot more!

does that mean you've even rolled your own webserver? Programming language?

That's... not good copy. I think it must have been written long ago. We use open-source libraries (with compatible licenses, of course) and even maintain our own set of open-source code (https://github.com/gravwell). I'll talk to the guys who maintain the website and get that fixed. Thanks for pointing it out!

Edit: We've had lots of people assume we use Elastic under the hood, so I wonder if that was just a (poorly-worded) attempt to indicate that our core storage and querying code is custom rather than some existing open-source solution.

Maybe you should just wipe that paragraph completely. I get that investors like to see that you are using proprietary code, but I wouldn't expect you to be faster with that. Especially when running against Elastic, which has over 1.400 contributors currently. But you don't necessarily need to. You can get me with being focused on the right thing and not bloating your software. Lot of big projects start to loose focus and start doing everything, hence become worse doing their main job.

Especially when it comes to security, I'd like to see the lowest complexity possible. Harden your software instead of feature-fu around. That would be a good USP (I've got the feeling that no vendor has realized this so far - but customers neither did).

I don't mind you giving a small sales pitch...and maybe your product is indeed what I'm searching for. But your pricing model is instantly putting me off. Same as with Splunk, you end up with not being able to predict your cost and paying way too much. Tell me when you fix that and I might be interested ;)

Edit: Sorry, I was misreading your comment. Premium is unlimited...I will look into it, thanks. :)

Yep, paying customers are licensed by the node rather than by the gigabyte (as Splunk does it), and you're really only limited by your hardware at that point. You might be surprised at how much you can accomplish on the free license, though--there are several small businesses using it to monitor their networks because 2GB/day will hold a pretty hefty amount of Netflow, collectd, Zeek, and syslog records.

(Disclaimer: CEO & founder of Tenzir)

We at Tenzir are developing VAST for this purpose: https://github.com/tenzir/vast. It's still very early stage, but if you're up for trying something new, a lean and modern C++ architecture, BSD-license open-source style, you may want to give it a spin. The docs are over at https://docs.tenzir.com/vast.

It supports full PCAP, NetFlow, and logs from major security tools. There is CLI and Python bindings. The Apache Arrow bridge offer a high-bandwidth output path into other downstream analytics tools.

Maybe Loki [1] meets your needs? It lacks the analytics abilities of Elastic (e.g. what's the average response time [2]) but is much simpler to setup and use for jog aggregations, and has a pretty powerful query language for digging through and graphing log statistics (e.g. how many errors have been logged per hour). It's mainly being developed by Grafana Labs, so there's great integration in Grafana.

[1] https://grafana.com/oss/loki/

[2] I'd argue this sort of thing should be published as a metric anyways so you don't have to pull it out of the logs

Tanks for the tip. But I would still need some storage and Data Shipper, right? Or is Loki also taking care of storage?

Promtail is the official log shipper for Loki, but you can also use others. See https://github.com/grafana/loki/blob/master/docs/sources/cli...

As for storage, the default is BoltDB for indexes and local file system for the data, but you can also use popular cloud solutions like DynamoDB, etc. AFAIK BoltDB is automatically installed when you install Loki.

The only possible pain point I see for you is that Loki is tailored for Kubernetes. It is totally possible to use it without running a K8 cluster, but you lose some features.

If you want to do logs, you can use graylog or kibana, both using elasticsearch for storage. This allows to find what was connecting where at some point in time (HTTP request logs and database connection logs).

If you want to graph connections from service to service in real time. I've actually never found anything that was capable of doing that, not even paid software.

Grafana together with Loki would be a good match for you

Some people are probably going to throw some shade on me for saying this since it's so out of fashion but in my mind, when it comes to some types of basic monitoring (SNMP monitoring of switches/linux servers, disk space usage, backups running and handling them when they don't) then Nagios does get the job done. It's definitely olives and not candy[1] but it's stable, modular, relatively easy to configure (when you get it) and it just keeps chugging away.

If anyone has any Nagios questions, I'd be happy to answer them. I'm a Nagios masochist.

(Also, I can recommend Nagflux[2] to export performance data, metrics, to InfluxDB because noone should have to touch RRD files)

[1]: https://lambdaisland.com/blog/2019-08-07-advice-to-younger-s...

[2]: https://github.com/Griesbacher/nagflux

Nagios is the Jenkins of monitoring. It's popular because you can get it running in an afternoon, and it's easy to configure by hand.

It then rots within your infrastructure, because it resists being configured any way _except_ by hand. I've built two systems for configuration-management of Nagios (at different companies), and it's an unpleasant problem to solve.

Prometheus's metric format and query syntax are cool, but the real star of the design is simply this: you don't have to restart it, or even change files on your Prometheus server, when you add or remove servers from your environment.

I have to use an icinga instance from time to time (icinga is a nagios fork). I really can't see the value, beyond seeing if a service is up or down.

I'm surprised no one has named Zabbix. Zabbix is way better. I hadn't the chance to use Zabbix past 4.something but it's worth it.

I've been using Prometheus/grafana and frankly the value I see is it's out of the box adaptability at capturing a mutating data source (example: metrics about ephemeral pods Una kubernetes cluster).

> I really can't see the value, beyond seeing if a service is up or down.

This is an extreme oversimplification. The value is not in "seeing" if something is "up or down", the value is in the modularity of what a "service" can mean in the first place (anything you can script -- and the eco-system of plugins is huge), the fact that you don't have to "see" it (because notifications are extremely modular), the fact that escalations of issues can happen automatically if they are not resolved, and the fact that event-handlers in many cases can help you resolve the issue automatically without even having to raise an alert in the first place.

Nagios is a monitoring tool built with the UNIX philosophy in mind, and it's ingenious in its simplicity: decide state based on script or binary exit codes, relate dependencies between objects to avoid unnecessary troubleshooting, notify if necessary (again, with scripts/binaries) and/or try to resolve if configured. It hooks into a server frame of mind very well if you're a sysadmin.

Sure, if you main use case is "mutating data sources" and collecting metrics, any Nagios flavor won't be for you, because it's not what Nagios is made to do. There's a reason it's extremely popular in large enterprises, because it was created for them. No monitoring solution is for everyone and solves every problem.

> you can get it running in an afternoon, and it's easy to configure by hand.

This read like a joke. Nagios looks like it's from stone age having files in cgi-bin folder with unnecessary complication to installation and management, unless they made it any better at some point.

While many people conflate "Nagios" with the corporate offering from the company Nagios, I personally mean the core monitoring component. There's no web interface to it (many are available, they're all ugly, but they're also not strictly necessary).

Yeah, I was in no way insinuating that Nagios is superior in general, or even to Prometheus, just that it does the job well for some use cases. Monitoring is tricky and you definitely need a tool box because each problem has a different optimal solution.

Nagios and its forks for sure have a place in the monitoring ecosystem. They’re just not tools that tend to stick around once you’re big enough to have a dedicated DevOps or SRE team.

That depends completely on what type of business you're running. Anyone that has an environment that is unlikely to massively change (expansion excluded) within a few years benefits from the stability of Nagios. I know many large companies that use it, or derivatives of it, and even many state/military organizations.

We run this at work, and I have (to put it politely) severe reservations. What does these functions for you:

- Realtime GUI which works with windows 10 (we have a web site and nagastamon) - aggregation of alerts / alert roll up - sharing filters - summary + description - temporary downtime of alerting - message rate suppression (to stop floods) - filtering of columns/ordering etc. - bulk actions for closing alerts

I've come from using: - HPOV (great but can't handle bursts of alerts) - email (everyone has to have filters, can't handle bursts of alerts that well (runs everyone out of email space!), prone to failure/delays due to email - home grown solution

I'm not sure what you mean by "this" but I suspect you mean Nagios XI or some other corporate offering. I was referring to the core monitoring component and its forks/derivatives.

Other than that I don't quite understand the point of your comment, you say you have "severe reservations" but many of the points you list are available even with Nagios core, and most of them are available in other Nagios variants.

Possibly stupid question:

Can you use Nagios to stream metrics exported from your applications binary in real time?

For example, can you use Nagios record each http request processed by your application webserver, tagged with http method, code, latency etc?

Nagios is not so much a "recorder" as it is a "state inspector", basically plugins run on a schedule and inspect that things are up to spec. The situation you describe may be better suited for something like the ELK stack which can hook into your HTTPD logs.

Ever looked at Zabbix?

I haven't, unfortunately, but it looks promising from just looking into it briefly. Open source monitoring is always an area that needs more competition.

I find the problem with monitoring is not a lack of options, but an overwhelming abundance of them. It’s almost impossible to evaluate all of them realistically, so you just wind up using the most popular one.

Zabbix has been around for quite a long time. Easily 15 years now. I haven't looked at it since around 2013, but at the time it was placing quite some pressure on a mysql db backend. It looks like they've expanded out to support more than MySQL as the back-end these days.

Zabbix has been growing A LOT lately, and in a good way. It's nice to see this king of projects evolving Una good direction instead of stagnating and the diying.

It's great to hear. I liked zabbix in general. At the time of its initial surge in popularity, nginx and cacti were the dominant force in monitoring. Zabbix was a little quirky to get used to, but a breath of fresh air.

I noticed that Zabbix supports PostgreSQL and TimescaleDB as back-ends and just checked the list, which contains also Oracle and SQLite (DB2 support is experimental).

I found the software in this stack to be very bloated and difficult to maintain. Large, complicated software has a tendency to fall flat on its face when something goes wrong, and this is a domain where reliability is paramount. I wrote about a different approach (Prometheus + Alertmanager) for sourcehut, if you're curious:


I have a bit of a love/hate relationship with Prometheus. At home I really like it; it was simple to set up for my needs and most of my configuration is on my server which then scrapes other machines for the data. However I find it quite frustrating at scale for work, both in its concepts (it's hard to describe but it's sort of...backwards?) and in its query performance, although that might be a side-effect of using it with Grafana and me attempting to misuse it. By contrast I think the concepts of something like TimescaleDB are easier to understand when it comes to scaling and optimising that service.

In my previous job I had a very clear use-case for not using Prometheus and did for a while use InfluxDB (it involved devices sending data from behind firewalls across many sites). I found it pretty expensive to scale and it fell over when it ran out of storage, which feels like something that should have been handled automatically considering it was a PaaS offering.

One point of note for SourceHut's Prometheus use is that we generally don't make dashboards. I don't really like Grafana. I will sometimes use gnuplot with styx to plot graphs on an as-needed basis:


This is how I made the plots in that blog post.

I have a similar relationship to Grafana as I do for Prometheus; love it for my home and I've got some very useful graphs for my home network, but it's almost unusable for work due to its speed degradation the moment you start adding more graphs. Again it's probably due to my lack of knowledge around some of the Prometheus functions for reducing the amount of data returned, but it would be nice if it could handle some of that automatically rather than just grinding to a halt.

Can't you generate the same kind of graphs you have there with the normal Prometheus query explorer / web ui?

On a basic level, yes, but I often just use it as a starting point for more complex gnuplot graphs, or different kinds of visualizations - box plots, histograms, etc.

I guess the https://github.com/prometheus/pushgateway could help with that? As for the query performance there's a lot of things you can do with recording rules, that might help a lot with speeding up dashboards or queries.

Yeah the pushgateway was the alternative to using InfluxDB. In the end we actually used Datadog for it, despite the cost, as it was just easier to scale on it (we had hundreds of devices per site). The pushgateway route with Prometheus just ended up feeling like there were too many things relying on each other, i.e. Prometheus -> Push Gateway <- Multiple agents on each device, is inherently more complex than just connecting directly to a DB/service from the device.

Try VictoriaMetrics next time. It supports data push via multiple popular data ingestion protocols [1] and it provides Prometheus-compatible API for Grafana [2].

[1] https://victoriametrics.github.io/#how-to-import-time-series...

[2] https://victoriametrics.github.io/#grafana-setup

For those who still remember Graphite, the team over at Grafana labs have started maintaining Graphite-web and Carbon since 2017 and it is still in active development getting improvements and feature updates. It might not scale as well as any of the other solutions, but for medium size or homelab setups it's still a nice solution if you don't like PromQL or InfluxQL.


for simple and fast monitoring solutions, i always opted for collected + graphite + grafana. In a containerized environment, it's so easy to deploy (0 configuration by default) and monitor a set of 50-100 nodes. Beyond that, disk tuning and downsampling (pre-aggregation) rules become important.

If you're still on graphite and need scaling, metrictank is something to consider.

+1 for vector. We moved from Logstash to Vector and we couldn't be happier. Logstash is awesome but its a memory hog.

With Vector and Toshi you can kinda (I am not sure Toshi is as mature as Elastic) use them to replace LogStash and Elastic, the missing piece is Kibana

Have you looked into Grafana Loki for logs? If I had to redo one part of our stack, that's what I'd choose.

Does Vector work with TimescaleDB? it looks quite interesting

I love prometheus. Its simple and the built in charts are enough without having to use Grafana on top.

The question is how to do long term storage though. Something I've had a bit of trouble rationing about.

Right now all of my metrics are sitting in a PVC with a 30d retention period, so we're probably fine but for longer term cold storage the options aren't great unless you want to run a custom Postgres instance with the Timescale plugin or something else more managed.

Do you really need long term? I hate throwing away data but realistically I never really need old performance data. Some stats data is worth keeping but you can extract a few important time series and store them elsewhere.

In any case, Prometheus is throwing away data if a scrap can’t be done. As clearly described on the website, prometheus is not a metrics system. So influx and Prometheus are quite different.

Mistake in my previous message. I wanted to say that Prometheus is not a log system. Metrics could be lost in scraping issues. This is ok in some cases but you have to know that you can loose data.

How about another Prometheus server in federation[0]?

It's as simple as another Prom instance pointing at the "live" Prom instance. You can filter only the metrics you want to keep for a long amount of time, and downsample if necessary (by just setting a higher scrape interval on the "long term" Prom).

Since this "long term" Prom isn't in the critical path you could skimp on processing resources and just give it a big disk as a cost optimization.

If you are on AWS (or equivalent) the storage there is pretty durable. On-prem you can run HA (two instances with identical config).

[0] https://prometheus.io/docs/prometheus/latest/federation/

I second the Thanos link, but if you want to only use Prometheus, I would look into federation:



You can set up a new "cold storage" Prometheus with a longer retention that scrapes select metrics from your regular 30d Prometheus to store for longer periods of time.

For long term retention, look at VictoriaMetrics

My vote is for VictoriaMetrics - the best option if you want to keep things simple and get great performance in the same time!

It's sad to see technically inferior products having more popularity.

VictoriaMetrics pretty much has everything you'd hope for as a Prometheus long term storage, like direct PromQL support, good performance and ease of installation.

I think that's what https://thanos.io/ is for.

Thanks for the link, I'll check it out.

Thanos or Cortex

I don't. It's quite difficult to do complex queries and its query language has a few gotcha. I think it's quite good and one of the best solutions today, but I look forward to something as simple and as fast, but with a proper query language.

PromQL from Prometheus has steep learning curve because of lack of good documentation. When you understand PromQL basics, then it is much easier to write typical queries over time series data in PromQL than in any other query languages (SQL, Flux, InfluxQL, etc.). I'd recommend reading the article about PromQL basics [1] and then feel the power of PromQL.

[1] https://medium.com/@valyala/promql-tutorial-for-beginners-9a...

Yes would be nice to be able to run sql or something more standard. I like that its just simple, one application and it just works.

InfluxDB and Grafana worked great for us when I created a live monitoring system for a fleet of prototype test robots. It was simple to set up new data streams. We started with Graphite but switched to InfluxDB for it's flexibility (Grafana works with both!)

I would add to the guide that you need to be careful about formatting the lines into InfluxDB because where you out the space and commas determines what is indexed or not! Also data types should be specific (ie make sure you are setting integer vs float correctly).

You can quickly do this on Windows:

    cinst influxdb1 /Service
    cinst grafana
    start $Env:ChocolateyInstall\lib\grafana\tools\grafana-*\bin\grafana-server.exe
    git clone https://github.com/majkinetor/psinflux
    import-module ./psinflux
    1..10 | % { $x = 10*$_ + (Get-Random 10); Send-Data "test1 value=$x"; sleep 1 }

I've only ever used third party monitoring tools, but hope to set up a startup again soon and want to do OSS if I can.

Can anyone comment on Prometheus vs Timescale? What are the tradeoffs? Or would I use Prometheus on top of Timescale?

(Timescale-Prometheus team lead here) I'd suggest using Timescale-Prometheus to connect Prometheus with TimescaleDB.

Repo: https://github.com/timescale/timescale-prometheus Design Doc: https://tsdb.co/prom-design-doc

You can use Prometheus on top of TimescaleDB. Timescale builds connector and entire workflow to run Prometheus on top of TimescaleDB and support Grafana in flexible way. Sorry for the promo :) check for details in https://github.com/timescale/timescale-prometheus


Prometheus is a monitoring system (it collects metrics and evaluates alerts on them), while TimescaleDB is a database. While it is possible to use TimescaleDB as a remote storage for Prometheus, I'd recommend taking a look at other remote storage integrations as well [1]. Some of them natively support PromQL (like VictoriaMetrics, m3db, Cortex), others may be easier to setup and operate. For instance, VictoriaMetrics works out of the box without complex configuration. Disclaimer: I work for VictoriaMetrics :).

[1] https://prometheus.io/docs/operating/integrations/#remote-en...

We switched from InfluxDB to TimescaleDB for our IoT solutions. InfluxDB is very difficult to work with large datasets and enterprise/region compliance. We ingest around 100MB data per day and growing.

That doesn't actually sound like a large dataset. Can you describe what kind of problems you faced with InfluxDB?

100MB data per day looks like a tiny number. Our users ingest 100GB data per day at the ingestion rate of 1M data points per second on a single-node setup [1].

[1] https://victoriametrics.github.io/CaseStudies.html

I've looked at these before, and I remember a few years ago when Grafana was really starting to get big, but I guess I have a bona-fide question: Who really needs this?

I manage a small homelab infra, but also an enterprise infra at work with >1,000 endpoints to monitor, and I/we use simple shell scripts, text files, and rsync/ssh. We monitor cpu load, network load, disk/io load, all the good stuff basically. The monitor server is just a DO droplet and our collectors require zero overhead.

The specs list and setup costs in time and complexity are steep with a Grafana stack - is there any value besides just the visual? I know they have the ability to do all manner of custom plugins, dashboards, etc, but if you just care about the good stuff (uptime+performance), what does Grafana give you that rsync'ing sar data can't?

PS: we have a graphical parser of the data written using python and matplotlib. very lightweight, and we also get pretty graphs to print and give to upstairs.

What sar+rsync doesn't provide:

- app-specific metrics

- quick and easy way to build a number of graphs searching for correlation (how to slice the data to get results that explain issues)

- log/metrics correlation

- unified way to build alerts

- ad-hoc changes - while you're in the middle of an incident and want to get information that's just slightly different from existing, or filter out some values, or overlay a trend - how long would it take in your custom solution vs grafana?

And finally - grafana exists. Why would I write a custom graph generator from a custom data store if I can setup a collector + influx + grafana in a fraction of that time and get back more?

I'm not experienced with the CollectD stack, but I use Prometheus + Grafana to monitor probes. My two cents:

- Fairly lightweight. Prometheus deals with quite a lot of series without much memory or CPU usage.

- Integration with a lot of applications. Prometheus lets me monitor not only the system, but other applications such as Elastic, Nginx, PostgreSQL, network drivers... Sometimes I need an extra exporter, but they tend to be very light on resources. Also, with mtail (which is again super lightweight) I can convert logs to metrics with simple regexes.

- Number of metrics. For instance, several times I needed to diagnose an outage and I need a metric that I didn't think about, and turns out that the exporter I was using did actually store it, it was just that I didn't include it in the dashboard. As an example, the default node exporter has very detailed I/O metrics, systemd collectors, network metrics... They're quite useful.

- Metric correctness. Prometheus appears to be at least decent at dealing with rate calculations and counter resets. Other monitoring systems are worse and it wasn't weird to find a 20000% CPU usage alert due to a counter reset.

- Alerts. Prometheus can generate alerts with quite a lot of flexibility, and the AlertManager is a pretty nice router for those alerts (e.g., I can receive all alerts in a separate mail, but critical alerts are also sent in a Slack channel).

- Community support. It seems the community is adopting the Prometheus format for exposing metrics, and there are packages for Python, Go and probably more languages. Also, the people who make the exporters tend to also make dashboards, so you almost always have a starting point that you can fine-tune later.

- Ease of setup. It's just YAML files, I have an Ansible role for automation but you can go with just installing one or two packages in clients and adding a line to a configuration file in the Prometheus master node.

- Ease of use. It's incredibly easy to make new graphs and dashboards with Prometheus and Grafana, no matter if they're simple or complex.

For me, the main points that make me use Prometheus (or any other monitoring config above simple scripts) is alerting and the amount of metrics. If you just need to monitor CPU load and simple stats, maybe Prometheus is too much, but it's not that hard to set up anyways.

Author here. I'll probably write another tutorial focusing on Prometheus, instead of CollectD.

Thanks for suggestion


It would be wonderful if you included limitations as well, to help people make the right decisions for their tech stack. I've been playing around with Prometheus lately for environmental monitoring, and long-term retention is particularly important to me.

During proof-of-concept testing, some historical data on disk perhaps wasn't lost per se, but definitely failed to load on restart. I haven't worked hard to replicate this but there are some similar unsolved tickets out there.

Additional traps for new players include customizing --storage.tsdb.retention.time and related parameters.

Thank you!

The biggest advantage is the near-real-time aspect of this. During an outage, having live metrics is essential. Does your custom system allow you to see live metrics as they happen, or do you need to re-run your aggregation scripts every 5 minutes?

At work and in my personal projects I use a Prometheus + Grafana setup and it's very easy to set up. Not sure what you mean with the complex and setup costs for a Grafana stack?

Alerting with the Prometheus AlertManager is also pretty straight forward and I'm looking at dashboards every day to see if everything is running smoothly or tracking down what's not working well if there are any issues. Grafana dashboards are always the second thing I look at after an alert fires somewhere and it has been invaluable.

It sounds like you home rolled a version of Grafana, CollectD et al. which probably took more time and effort than just installing Grafana CollectD et al.

I think you are overestimating the time and complexity to install and set up prometheus + grafana on a box, node exporter on your hosts and copy/paste a grafana dashboard for node exporter (which is your use case).

It gets complex only when you start monitoring your apps (i.e. using a prometheus client library to generate and export custom app metrics) and create custom grafana dashboards for these metrics. Or if you need to monitor some niche technology without its own existing prometheus exporter. Then yes, you need to read the docs, think about what you need to monitor and how, write code...

I love this attitude (simplest thing that can possibly work) and am trying to write book about how to run a whole syseng department in this way.

In my view grafana shines in app data collection. And what we want a lot of is that (my simplest thing that can possibly is just have a web server accepting metrics ala carbon/graphite)

So one is likely to have grafana already lying around when one does the infra monitoring - I guess your choice is now how to leverage your set up to do app monitoring ?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact