Where is the PMC support? At Facebook a few days ago, they said their number one issue was memory bandwidth. Try analyzing that without PMCs. You can't. And that's their number one issue. And it shouldn't be a surprise that you need PMC access to have a decent Linux monitoring/analysis tool. If that's a surprise to you, you're creating a tool without actual performance expertise.
Should front BPF tracing as well... Maybe it will in the future and I can check again.
Second, it's great that you want a car instead of a faster horse, but I want 10 horses. Why?
- 1 horse trained for 1 job easier & more reliable than 1 horse/car trained for 10
- 10 horses more flexible than 1 horse or 1 car
- if a horse isn't doing its job, shoot it and replace it
- don't need to go to horse school to learn how to use or maintain one
- don't need to hire an engineer to teach horse new trick
- most problems can be solved with a horse
If instead of focusing on the transportation we were focusing on the road, we could build a foundation for horses and cars to coexist. I'd love to see more protocols and specifications for each kind of monitor (other than syslog, RRD & SNMP) so they could simply be made according to spec and we could mix and match as we wished.
For this tool I wonder what the observer's effect is.
- PMCs: Vector's backend is pcp, which has a Linux perf_events pmda for PMC support. There's no PMC counters by default in Vector since we're using it in an environment without PMC access. But I've been heavily working in this area. More later on.
- Flame graph support: we already have it and use it. Ongoing work includes new flame graph types, and a rewritten flame graph implementation in d3. Need to get it all published.
- Heat map support: just solved an issue with them; again, an area where we have ad hoc tools that work and bring value, but haven't wrapped it all in Vector yet.
- BPF support: I've been prototyping many new metrics that we want in Vector, https://github.com/iovisor/bcc#tools. There'll be increased demand to get Vector accessing these in a couple of months or so, when we have newer kernels in production that have BPF.
Other than that it looks AWESOME.
Fiji Islanders: 0.8% vs Arab: 10%
So poor design is racist :)
EDIT: The article says "Arabs (Druzes)". So, if they're specifying the Druze ethnicity, does that have any implications for Arabs in general?
Notable features that I would need all relate to multi-server usage:
- central config across hosts
- alerting when values go over or under thresholds
- a mode for automatically selecting and viewing the machines which are working hardest, or not working
- a mode for viewing of a few stats across all machines
- a mode for slide-show viewing of a few stats across all machines
This seems super useful to zoom in one some performance characteristics of a single server to debug some issues!
For historical "monitoring", alerts, I still need my NewRelic / ELK kind of tools.
With netdata, there is no need to centralize anything for performance monitoring. You view everything directly from their source. Still, with netdata you can build dashboards with charts from any number of servers. And these charts will be connected to each other much like the ones that come from the same server.
So seems like this is on purpose. And if I understand this correctly, I can just build custom HTML dashboard to connect to multiple machines?
Case in point: https://github.com/Shopify/dashing
Here's the presumed knowledge:
It's plain to see why outfits like Splunk can get away with charging as much as they do - visualizing metrics in that app is as simple as installing a deb package, logging in, and pointing it at your data source.
Setting up Dashing by hand is comparatively ...difficult.
There's a much more approachable all-JS Riemann clone called Godot that I've successfully used on multiple projects, but it still requires some work to make the frontend look good.
> I believe the future of data collectors is node.js
In any event, we're talking about a telemetry collector, which is generally a trivial piece of code that simply doesn't need whatever benefits JS purports to provide.
Why don't you like it? What's wrong with it? What would be better?
That's not what I see.
c.f. the disaster with npm a few days ago.
These problems are particularly pernicious for large scale apps - which this isn't, currently, but probably aspires to be one day.
Any strongly typed language would be better (python, go, ruby, etc.).
I'd prefer something like Lua (and C, of course), or even Python (which is installed on most systems anyway, but still too heavy).
Consider an application spikes (close to 100% on a core) for 2-3s on some web requests -- let's assume this is normal (nothing can be done about it). Now, let's consider the average user of the system is idle for 2 minutes per web request. So, users won't see performance degradation unless $(active-users) > $(cores) during a 2-3 minute window.
For most monitoring systems, CPU is reported as an average over a minute, and, even if it's pinned only 2-3s per 60s, that's only 5% usage. Presume a 2 CPU system with 5 users, who all happen to be in a conference call... and hitting the system at exactly the same time (but are otherwise mostly idle). The CPU graph might show 10-15% usage (no flag). Yet, those 5 users will report significant application performance issues (one of the users will have to wait 6-9s).
What I'd like to monitor, as a system administration, is the 95% utilization of the CPUs -- that is, over the minute, throw away the bottom 94% (mostly idle cycles) and report to me the CPU utilization of the next highest percentile. This should show me those pesky CPU spikes. Anything do that?
I think what you want for your application, if I understand any of it correctly, is the time spent by a runnable process waiting to get scheduled on a CPU. That would indicate contention. If a process runs immediately when it becomes runnable, then there's no contention, no matter what the windowed utilization looks like.
There's a reason why ISPs bill at the 95% -- it's directly correlated with costs of over-subscribed resources. If you're running a VM cluster, you've got a similar issue, only with CPUs (and memory) as well as than bandwidth. Most operational divisions don't have the luxury of fixing a vendor application. Instead, they have limited variables to play with: CPU cores and memory being the coarse levers. While one could argue this is an "application" issue and you'd be correct, it's irrelevant to my question.
I'm asking what tools are available for system administrators to diagnose and address CPU starvation (under spiked usage). Current tools and techniques I'm aware of don't seem to measure this.
That being said, historic CPU utilization is a useful metric for capacity planning.
I think your concern about measuring CPU utilization is real, though. You can use frequent sampling and present samples on a heat map to deal with this problem. There are some examples here:
On my RPi B, the daemon eats 4% average on all four cores, with almost all the time spent in the kernel. I assume polling the various entries under /proc/ is costly.
But I wish it were a Riemann/Graphite/whatever dashboard instead of reimplementing its own data collection system.
There is a need for great dashboards, but I don't feel any need for yet another format of data collection plugins.
But also weird. The fact that both the collectors, the storage and the UI runs on each box makes this more like a small-scale replacement for top and assorted command-line tools such as iostat than for a scalable, distributed monitoring system. Lack of central collection means you cannot get a cluster-wide view of a given metric, nor can you easily build alerting into this.
I'm also disappointed that it reimplements a lot of collectors that already exists in mature projects like Collectd and Diamond (and, more recently, Influx's Telegraf). I understand that fewer external dependencies can be useful, but still, does every monitoring tool really need to write its own collector for reading CPU usage? You'd think there would be some standardization by now.
For comparison, we use Prometheus + Grafana + a lot of custom metrics collectors. Grafana is less than stellar, though. I'd love to have this UI on top of Prometheus.
Nice open source solution (everything open source, not "open core") with possibility to pay a reasonable (my words) amount for support.
Source: Use it at work, unsupported (because we are small and it just works without support at our current scale). Otherwise unaffiliated.
After some time of using graphite + statsd and friends, I came to really appreciate the benefits of using widely adopted open source components and the flexibility it gives over all-in-one solutions such as this. On the other hand, solutions like this are much easier to configure, especially the first time when you are not familiar with the tools yet.
I have less of a realtime system review need than a post-mortem need. Today, I'll use kSar to do that, but this tool looks much more capable.
It's too bad that it doesn't provide an init script or other startup feature. The installer, while it doesn't seem to follow typical distribution patterns, is otherwise fairly complete.
Init scripts are in the system directory.
I have already decoupled the code for handling the chart comments, but it needs some more work to remove it from the dashboard html and put it in separate json files.
Edit: code to add an entry to the dictionary releases its lock, whereupon you can wind up with duplicate NV pairs.
Why does "found a bug" equate to an immediate "ugh"? Is the expectation that projects submitted here are perfect?
It's also true that pretty much every piece of monitoring software I encounter makes me sad when I look under the hood. And the proprietary ones that I've seen under the hood of . . . hoo boy.
Also, it would be nice if this could be run over multiple machines and show combined results.
Further, it appears that this tool shows information that other tools currently do not show. Perhaps nice if this tool allowed scripting and/or a CLI.
Regarding scripting, please open a github issue to discuss it. I really like the idea.
For folks wanting multiple servers, the wiki does mention that i believe at https://github.com/firehol/netdata/wiki#how-it-works
Also, running plugins can increase CPU usage.
Also looks like you can turn off plugins you don't want/need: https://github.com/firehol/netdata/wiki/Configuration