I'm super excited about prometheus, and can't wait to get some time to see if I can make it work on my rasberry pi. That being said, I'm also going to likely eventually work on a graphite-web / graphite-api pluggable backend to use prometheus as the backend storage platform.
The more OSS metrics solutions, the better!
I have large graphite install of 20+ graphite carbon nodes running on SSDs and three additional graphite-web instances in front generating graphs. Ingesting something like 1 million metrics/min.
Also I didn't realize there were still graphite maintainers (seriously. not trolling). There hasn't been a release of graphite in well over a year. I assumed it was dead by now. Any idea when we'll get a fresh release?
Note that 0.9.13 is almost ready to be cut:
Anything in the master branch is what will be in 0.10.0 when we're ready to cut that. I think we'll spend some more cycles in 0.10.x focusing on non-carbon / non-whisper / non-ceres backends that should allow much better scalability. Some of these include cassandra, riak, etc.
For it timing out, it is a matter of general sysadmin spleunking to figure out what is wrong. It could be IO on your carbon caches, or CPU in your render servers (where it uses cairo). I'm a HUGE fan of grafana for doing 100% of the dashboards and only using graphite-web to spit out json, or alternatively to use graphite-api.
Take a look at the maxDataPoints argument to see if that will help your graphs to not timeout however.
CPU on the render servers is low. IO on the carbon caches is acceptable (10k IOPS on SSDs that support up to 30k or so). If the CPU Usage Type graph would render it would show very little IO Wait (~5%). Graphs if you're interested: http://i.imgur.com/dCrDynY.png
Anyway thanks for the response. I'll keep digging. Looking forward to that 0.9.13 release!
Also look at graphite-api, written by a very active graphite committer. It is api only (only json), but absolutely awesome stuff. Hook it up to grafana for a real winner.
Unfortunately though Prometheus lacks easy horizontal scaling just like Graphite. It sounds like Prometheus is worse actually since it mentions manual sharding rather than consistent hashing that Graphite does. This rules out Prometheus as an alternative to Graphite for me even if it does render complex graphs better. I'm definitely keeping my eye on this one though.
From experience that much data on a page makes it quite difficult to comprehend, even for experts. I've seen hundreds of graphs on a single console, which was completely unusable. Another had ~15 graphs, but it took the (few) experts many minutes to interpret them as it was badly presented. A more aggregated form with fewer graphs tends to be easier to grok. See http://prometheus.io/docs/practices/consoles/ for suggestions on consoles that are easier to use.
> It sounds like Prometheus is worse actually since it mentions manual sharding rather than consistent hashing that Graphite does.
The manual sharding is vertical. That means that a single server would monitor the entire of a subsystem (for some possibly very broad definition of subsystem). This has the benefit that all the time series are in the same Prometheus server, you can use the query language to efficiently do arbitrary aggregation and other math for you to make the data easier to understand.
Lots of lines on a single graphs helps you notice imbalances you may not have noticed before. For example if a small subset of your cluster has lower CPU usage then you likely have a load balancing problem or something else weird going on.
What happens when a single server can no longer hold the load of the subsystem? You have to shard that subsystem further by something random and arbitrary. It requires manual work to decide how to shard. Once you have too much data and too many servers that need monitoring, manual sharding becomes cumbersome. It's already cumbersome in Graphite since expanding a carbon-cache cluster requires moving data around since the hashing changes.
I think it's important to have structure in your consoles so you can follow a logical debugging path, such as starting at the entry point of our queries, checking each backend, finding the problematic one, going to that backend's console and repeating until you find the culprit.
One approach console wise is to put the less important bits in a table rather than a graph, and potentially have further consoles if there were subsystems complex and interesting enough to justify it.
I'm used to services that expose thousands of metrics (and there's many more time series when labels are taken into account). Having everything on consoles with such rich instrumentation simply isn't workable, you have to focus on what's the most useful. At some point you're going to end up in the code, and from there see what metrics (and logs) that code exposes, ad-hoc graph them and debug from there.
> Lots of lines on a single graphs helps you notice imbalances you may not have noticed before. For example if a small subset of your cluster has lower CPU usage then you likely have a load balancing problem or something else weird going on.
Agreed, scatterplots are also pretty useful when analysing that sort of issue. Is it that the servers are efficient, or are they getting less load? A qps vs cpu scatterplot will tell you. To find such imbalances in the first place, taking a normalized standard deviation across all of your servers is handy - which is the sort of thing Prometheus is good at.
> You have to shard that subsystem further by something random and arbitrary.
One approach would be to have multiple prometheus servers with the same list of targets, and configured to do a consistent partition between them. You'd then need to do an extra aggregation step and get the data from the "slave" prometheus servers up to a "master" prometheus via federation. This is only likely to be a problem when you hit thousands of a single type of server, so the extra complexity tends to be manageable all things considered.
They have the code to do in-place method of appending chunks, but in my admittedly brief foray through the code I could not actually find any use of it.
Just kidding, this is looking really good, I hope to get some hands-on experience with it soon.
This does seem to have addressed at least a couple of the issues with that system, in that its config language is sane, and its scrape format is well-defined and typed.
(One example from 2 years ago: https://www.reddit.com/r/IAmA/comments/177267/we_are_the_goo...)
That strikes me as a bit paranoid, not letting the name of a monitoring system be revealed. Am I missing something?
I wouldn't be surprised if many of such cases of non-disclosure are to avoid hungry lawyers for trying to suck some blood out of rich companies. I guess the star trek franchise is owned by Paramount or something like this.
I wonder what InfluxDB means by "distributed", that is, if I could use it to implement a push (where distributed agents push to a centralized metric server) model.
I wouldn't totally rule out the pushgateway for this use case. If we decided to implement a metrics timeout in the pushgateway (https://github.com/prometheus/pushgateway/issues/19), this would also take care of stale metrics from targets that are down or decommissioned. The pushing clients would probably even want to set a client-side timestamp in that case, as they are expected to be pushing regularly enough for the timestamp to not become stale (currently Prometheus considers time series stale that are >5 minutes old by default, see also the "Improved staleness handling" item in http://prometheus.io/docs/introduction/roadmap/.
Yeah, I did that benchmark with 11x overhead for storing typical Prometheus metrics in InfluxDB in March of 2014. Not sure if anything has changed conceptually since then, but if anyone can point out any flaws in my reasoning, that'd be interesting:
We can't pull them, because hitting the load balancer would randomly choose only one instance.
Instances are scaled up based on load. So we can't specify the target instances in Prometheus because it keeps on changing.
We'd like to try this out, but any ideas what to do for the above?
We're working on service discovery support so that you can dynamically change what hosts/ports Prometheus scrapes. Currently you can use DNS for service discovery, or change the config file and restart prometheus.
Previously, you would have had to encode metadata in the series name. Otherwise if you used string columns you'd see a massive waste in disk space since they were repeated on every measurement.
For metrics I go Riemann->Graphite. Riemann comes with a graphite compatible server so I push straight to that for processing and alerting. I also send from Riemann to logstash and logstash to Riemann where it makes sense.
For my metrics dashboard I use Grafana which is really awesome. I make use of its templating pretty heavily as I run a lot of individual IIS clusters. I can create cluster overview and drill down dashboards and template it so I can just change the cluster number to see its stats. You can also link directly into a dashboard passing variables as a query string parameter. Pretty excellent.
One challenge with that model of metrics is that it assumes that the monitoring system doesn't have much ability to work with samples, so it makes up for this by calculating rates and quantiles in the client. This isn't as accurate as doing it in the server, and some instrumentation systems don't allow you to extract out the raw data to allow you to do it in the server.
For example say you've a 1-minute rate exported/scraped every minute. If you miss a push/scrape you lose all information about that minute. Similarly if you're delayed by a second, you'll miss a spike just after the previous push/scrape. If instead you expose a monotonically increasing counter and calculate a rate over that in the server, you don't lose that data.
My full thoughts are up at http://www.boxever.com/push-vs-pull-for-monitoring
The short version is that I consider pull slightly better, but not majorly so. Push is more difficult to work with as you scale up, but it is doable.
In the next version of InfluxDB (0.9.0), you can encode metadata as tags and it gets converted to a single id.
With either of those schemes you should see much better numbers on storage.
My understanding is that InfluxDB will work for your use case, as it's all push.
> I see the pushgateway, but it seems deliberately not a centralized storage.
Yeah, the primary use case for the pushgateway is service-level metrics for batch jobs.
May I ask why accessing your producers is a problem? I know for Storm I was considering writing a plugin that'd hook into it's metrics system, that'd push to the pushgateway and Prometheus would then scrape. Not my preferred way of doing things, but some distributed processing systems need that approach when work assigment is dynamic and opaque across slaves.
A pull model, while technically distributed, is organizationally centralized: I have to get each node's owner to grant me direct access. Politically, and for security reasons, that's not going to happen.
http://www.boxever.com/push-vs-pull-for-monitoring looks at other bits of the push vs. pull question.
From my experience with distributed systems push replication will get you in trouble, very soon.
Edit: I was too quick to post this question. An obvious scenario is where the client is behind a firewall. Never mind me. I am an idiot.
> which organizes sample data in chunks of constant size (1024 bytes payload). These chunks are then stored on disk in one file per time series.
That is concerning, is this going to have the same problem with disk IO that graphite does? i.e. Every metric update requires a disk IO due to this one file per metric structure.
Chunks are only written to that one file per time series once they are complete. Depending on their compression behavior, they will contain at least 64 samples, but usually a couple of hundreds. Even then, chunks are just queued for disk persistence. The storage layer operates completely from RAM, only has to swap in chunks if it was evicted from memory previously.
Obviously, if you consistently create more sample data than your disk can write, the persist queue will back up at some point and Prometheus will throttle ingestion.
Combine with the Cacti pull model, and I think a wait-and-see attitude is the best for this for now.
> file systems are way better in managing the data
Except they're not managing data, they're just separating tables, to extend the DB metaphor. And you still run the chance of running out of inodes on a "modern" file system like ext4.
After having briefly dug into the code, I'm particularly worried about the fact that instead of minimizing iops by only writing the relevant changes to the same file, Prometheus is constantly copying data from one file to the next, both to perform your checkpoints and to invalidate old data. That's a lot of iops for such basic (and frequently repeated) tasks.
Still in wait-and-see mode.
RRD Tool: expects samples to come in at regular intervals and expects old samples to be overwritten by new ones at predictable periods. It's great because you can just derive the file position of a sample based on its timestamp, but in Prometheus, samples can have arbitrary time stamps and gaps between them, and time series can also grow arbitrarily large (depending on currently configured retention period) and our data format needs to support that.
InnoDB: not sure how this works internally, but given it's usually used in MySQL, does it work well for time series data? I.e. millions of time series that each get frequent appends?
KairosDB: depends on Cassandra, AFAICS. One main design goal was to not depend on complex distributed storage for immediate fault detection, etc.
InfluxDB: looks great, but has an incompatible data model. See http://prometheus.io/docs/introduction/comparison/#prometheu...
I guess a central question that touches on your iops one is: you always end up having a two-dimensional layout on disk: timeseries X time. I don't really see a way to both store and retrieve data in such a way that you can arbitrarily select a time range and time series without incurring a lot of iops either on read or write.
It's a key/value store at its heart, with all the ACID magic and memory buffering built in.
Almost any KV store would preform relatively well at time series data by simply by updates to overwrite old data instead of constantly deleting old data (assuming the KV store is efficient in its updates).
Issuing updates instead of deletes is possible because you know the storage duration and interval, and can thus easily identify an index at which to store the data.
[time series fingerprint : time range] -> [chunk of ts/value samples]]
At least this scheme performed way worse than what we currently have. You could say that file systems also come with pretty good memory buffering and can act as key-value stores (with the file name being the key and the contents the value), except that they also allow efficient appends to values.
> Issuing updates instead of deletes is possible because you know the storage duration and interval, and can thus easily identify an index at which to store the data.
Do you mean you would actually append/update an existing value in the KV-store (which most KV stores don't allow without reading/writing the whole key)?
EDIT: found https://github.com/influxdb/influxdb/pull/1059. Ok, does seem like tags mean key=value pairs.
Invalidation of old data is super easy with the chunked files as you can simply "truncate from the beginning", which is supported by various file systems (e.g. XFS). However, benchmarks so far showed that the relative amount of IO for purging old chunks is so small compared to overall IO that we didn't bother to implement it yet. Could be done if it turns out to be critical.
> Prometheus collects metrics from monitored targets by scraping metrics HTTP endpoints on these targets.
I wonder if we'll see some plugins that allow data collection via snmp or nagios monitoring scripts or so. That would make it much easier to switch large existing monitoring systems over to prometheus.
Just last night I wrote the https://github.com/prometheus/collectd_exporter, which you could do SNMP with. I do plan on writing a purpose-designed SNMP exporter in the next a few months to monitor my home network, if someone else doesn't get there first.
Not that I'd do a better job, but every time I further configure our monitoring system, I get that feeling that we're missing something as an industry. It's a space with lots of tools that feel too big or too small; only graphite feels like it's doing one job fairly well.
Alerting is the worst of it. Nagios and all the other alerting solutions I've played with feel just a bit off. They're either doing too much or carve out a role at boundaries that aren't quite right. This results in other systems wanting to do alerting, making it tough to compare tools.
As an example, Prometheus has an alert manager under development: https://github.com/prometheus/alertmanager. Why isn't doing a great job at graphing enough of a goal? Is it a problem with the alerting tools, or is it a problem with boundaries between alerting, graphing, and notifications?
I'd think you've hit the nail on the head, every monitoring system has to do a bit of everything such as alerting, machine monitoring, and graphing; rather than focusing on just doing one thing and doing it well.
Prometheus comes with a powerful data model and query language, few existing systems support the same (e.g. Atlas and Riemann have the notion of a query language and labels) so we have to do a bit of everything to produce a coherent system.
I think a separate common alert manager would be good if we could combine, currently in practice that role isn't being done as you rely on de-duplication in a tool such as Pagerduty with support for silencing alerts not being unified.
It also now supports Logstash and Graphite as backends as well. The Graphite support is thanks to work at Vimeo.
Another nice thing about Bosun is you can test your alerts against time series history to see when they would have triggered so you can largely tune them before you commit them to production.
Interesting work on Bosun by the way! Seems like there is quite some overlap with Prometheus, but I yet have to study it in depth. Is my impression correct that OpenTSDB is a requirement, or is there any local storage component? I guess you could run OpenTSDB colocated on a single node...
I need to take a closer look at stealing ideas from your tool :-) We are both leveraging Go templates (Bosun uses it for Alert notifications, but I've thought about using it to create dashboards as well).
client app -> Prometheus server -> alerting rules -> alert handling
This allows e.g. initial dimensional labels from the clients to cleanly propagate all the way into alerts, and allows you to silence, aggregate, or route alerts by any label combination.
If OpenTSDB-style dimensional metrics would have been more of a standard already, maybe that would be different.
For example, in response to the latest security hullabaloo, we recently added a Nagios check for libc being too old, and for running processes that are linked to a libc older than the one on disk. This isn't a time series check; it's just an alert condition. Polling for that alert condition isn't what Prometheus is going to be good at, so cramming it into that alert pipeline is awkward. If we used something like Prometheus, we'd have rules in two places already.
Polling prometheus for saved queries and simply alerting on a threshold might result in a simpler system. The rules that could be expressed in an alert pipeline could be expressed as different queries, and all rules about thresholds, alerting or notifications could be done further down the pipeline. There's still some logic about what you're measuring in Prometheus, but not about what thresholds are alerts, or when, or for how long, or who cares. Threshold rules could all live in one place.
This isn't a knock on Prometheus, which of course supports a workflow along those lines, and seems interesting. Nor is it a knock on the prometheus alert system. I just wonder why software in the monitoring/notification/alerting space always ends up muddy. I believe the overlapping boundaries between projects are a symptom, not a cause. I keep asking myself if there's some fundamental complexity that I'm not yet appreciating.
I think you're conflating instrumentation and alerting.
Many systems only offer alert conditions on very low-level data that's tightly-bound to the instrumentation (e.g. a single check on a single host), Prometheus is more powerful as the instrumentation and alerting are separate, with dimensions adding further power.
Prometheus has many instrumentation options. For your example I'd suggest a cronjob that outputs a file to the textfile node_exporter module (echo old_libc_process $COUNT > /configured/path/oldlibc.prom).
You can then setup an alert for when the value is greater than 0 on any host (ALERT OldLibcAlert IF old_libc_process > 0 WITH ...). The advantage of having all the dimensions is that you can analyse that over time, and graph how many hosts had a problem to trend how well you're dealing with the libc issue.
A big win for having dimensions for alerting all the way up the stack is in silencing, notification and throttling of related alerts. You can slice and dice and have things as granular as makes sense for your use case.
> Polling prometheus for saved queries and simply alerting on a threshold might result in a simpler system. The rules that could be expressed in an alert pipeline could be expressed as different queries, and all rules about thresholds, alerting or notifications could be done further down the pipeline.
There is an alert bridge to nagios if you want to send your alerts out that way. You'll lose some of the rich data that the dimensions provide though.
A bit, because I'm hesitant to cram into a time series database data that I consider boolean. I'd move past this for consistency if the checks were easy to create, though.
> Prometheus has many instrumentation options. For your example I'd suggest a cronjob that outputs a file to the textfile node_exporter module (echo old_libc_process $COUNT > /configured/path/oldlibc.prom).
Interesting. That architecture or configuration of this exporter isn't documented yet, but if just magically sent numbers upstream from text files that I keep up to date, that might be worth a migration by itself. Nagios SSH checks are ridiculous to manage.
Thanks for indulging my conversation in any case. I'll put this on my radar and watch where it goes.
A boolean can be treated as a 1 or a 0. Sometimes you can even go richer than that, such as with a count of affected processes from your example - which you could convert back to a 1/0 if you wanted in the expression language.
> That architecture or configuration of this exporter isn't documented yet
The node_exporter as a whole still needs docs. That module is pretty new, and it's not exactly obvious what it does or why it exists.
Labels/dimensions are supported too, it's the same text format that Prometheus uses elsewhere ( http://prometheus.io/docs/instrumenting/exposition_formats/)
> if just magically sent numbers upstream from text files that I keep up to date
Pretty much, you'll also get the rest of the module of the node exporter (cpu, ram, disk, network) and associated consoles which http://www.boxever.com/monitoring-your-machines-with-prometh... describes how to setup.
I could see Riemann being used as an alert manager on top of Prometheus, handling all the logic around de-duping of alerts and notification. Prometheus's own alert manager is considered experimental.
I've been experimenting with metrics collection using heka (node) -> amqp -> heka (aggregator) -> influxdb -> grafana. It works extremely well and scales nicely but requires writing lua code for anomaly detection and alerts – good or bad depending on your preference.
I highly recommend considering Heka for shipping logs to both ElasticSearch and InfluxDB if you need more scale and flexibility than Prometheus currently provides.
From experience of similar systems at massive scale, I expect no scaling problems with pulling in and of itself. Indeed, there's some tactical operational options you get with pull that you don't have with push. See http://www.boxever.com/push-vs-pull-for-monitoring for my general thoughts on the issue.
InfluxDB seems best suited for event logging rather than systems monitoring. See also http://prometheus.io/docs/introduction/comparison/#prometheu...
Agreed that InfluxDB is suited for event logging out of the box, but the March 2014 comparison of Influx is outdated IMO.
I'm using Heka to send numeric time series data to Influx and full logs to ElasticSearch. It's possible to send full logs to non-clustered Influx in 0.8, but it's useful to split out concerns to different backends.
I also like that Influx 0.9 dropped LevelDB support for BoltDB. There will be more opportunity for performance enhancements.
However, if the data model didn't change fundamentally (the fundamental InfluxDB record being a row containing full key/value metadata vs. Prometheus only appending a single timestamp/value sample pair for an existing time series whose metadata is only stored and indexed once), I wouldn't expect the outcome to be qualitatively different except that the exact storage blowup factor will vary.
Interesting to hear that InfluxDB is using BoltDB now. I benchmarked BoltDB against LevelDB and other local key-value stores around a year ago, and for a use case of inserting millions of small keys, it took 10 minutes as opposed to LevelDB taking a couple of seconds (probably due to write-ahead-log etc.). So BoltDB was a definite "no" for storing the Prometheus indexes. Also it seems that the single file in which BoltDB stores its database never shrinks again when removing data from it (even if you delete all the keys). That would also be bad for the Prometheus time series indexing case.
Basically, when the new version comes out, all new comparisons will need to be done because it's changing drastically.
I think we expected that, feel free to add comments on the doc for things that are different now.
Yes, instrument everything. See http://prometheus.io/docs/practices/instrumentation/#how-to-...
> I worry that I would have lots of metrics to backup my wrong conclusions.
This is not so much a problem with time series as a question of epistemology. Well chosen consoles will help your initial analysis, and after that it's down to correct application of the scientific method.
> I also worry that so much irrelevant data would drown out the relevant stuff
I've seen many attempts by smart people to try and do automatic correlation of time series to aid debugging. It's never gotten out of the toy stage, as there is too much noise. You need to understand your metrics in order to use them.
Some kind of SQL backend is a dependency for now, however.
I currently use a combination of sensu/graphite/grafana which allows a lot of flexability (albeit with some initial wrangling with the setup)
Of course a piecemeal solution is more flexible, but as you said, configuration can be a beast, so many people prefer monolithic systems.
The tech does look awesome though!
Currently you can manually vertically shard, and in future we may have support for some horizontal sharding for when the targets of a given job are too many to be handled by a single server. You should only hit this when you get into thousands of targets.
Our roadmap includes hierarchical federation to support this use case.
> You mention it's compatible with TSDB, why did you choose to implement your own backend, or is this a fork of TSDB?
Prometheus isn't based on OpenTSDB, though it has the same data model. We've a comparison in the docs. The core difference is that OpenTSDB is only a database, it doesn't offer a query language, graphing, client libraries and integration with other systems.
We plan to offer OpenTSDB as a long-term storage backend for Prometheus.
After playing around with Prometheus for a day or so, I’m convinced I need to switch to Prometheus :). The query language is so much better than what InfluxDB and others provide.
Thanks, that's awesome to hear! Feel free to also join us on #prometheus on freenode or our mailing list: https://groups.google.com/forum/#!forum/prometheus-developer...
I suppose that there's a simple service that we need to deploy on each server?
Any tips on this use case?
> For machine monitoring Prometheus offers the Node exporter
Is it possible for the frontend to utilize data from the cron-invoked sar/sadc that already covers much of this data?
As an aside, if you have machine-level cronjobs you want to expose metrics from you can use the textfile module of the node_exporter, which reads in data from *.prom files in the same format as accepted by the Pushgateway.
If you want to help out, it's up at https://github.com/brian-brazil/client_python