
Netdata – Linux performance monitoring, done right - cujanovic
https://github.com/firehol/netdata
======
brendangregg
Looks like another faster horse. A pretty GUI on /proc is not the most burning
issue to solve in Linux performance monitoring. I wish anyone making these
tools would spend 30 minutes watching my Monitorama talk about instance
monitoring requirements at Netflix:
[http://www.brendangregg.com/blog/2015-06-23/netflix-
instance...](http://www.brendangregg.com/blog/2015-06-23/netflix-instance-
analysis-requirements.html) . I still hate gauges.

Where is the PMC support? At Facebook a few days ago, they said their number
one issue was memory bandwidth. Try analyzing that without PMCs. You can't.
And that's their number one issue. And it shouldn't be a surprise that you
need PMC access to have a decent Linux monitoring/analysis tool. If that's a
surprise to you, you're creating a tool without actual performance expertise.

Should front BPF tracing as well... Maybe it will in the future and I can
check again.

~~~
zokier
Just curious; how closely you are involved in Vector development? Do you think
Vector tackles the issues you mention?

~~~
brendangregg
Yes, I'm involved, and no, I don't think the current public release of Vector
tackles many of the issues yet. But we've been working on them and will have
it released when we can. Items include:

\- PMCs: Vector's backend is pcp, which has a Linux perf_events pmda for PMC
support. There's no PMC counters by default in Vector since we're using it in
an environment without PMC access. But I've been heavily working in this area.
More later on.

\- Flame graph support: we already have it and use it. Ongoing work includes
new flame graph types, and a rewritten flame graph implementation in d3. Need
to get it all published.

\- Heat map support: just solved an issue with them; again, an area where we
have ad hoc tools that work and bring value, but haven't wrapped it all in
Vector yet.

\- BPF support: I've been prototyping many new metrics that we want in Vector,
[https://github.com/iovisor/bcc#tools](https://github.com/iovisor/bcc#tools).
There'll be increased demand to get Vector accessing these in a couple of
months or so, when we have newer kernels in production that have BPF.

------
sputr
Don't use the red-green combination in charts as it makes it really hard to
read for those of us with a degree of red-green color blindness (which is the
most common type in the ~5% of the male and ~1% female population that has
it).

Other than that it looks AWESOME.

~~~
woodman
The female rate is closer to 0.5%, the male rate varies wildly based on race
[0]:

Fiji Islanders: 0.8% vs Arab: 10%

So poor design is racist :)

[0]
[https://en.wikipedia.org/wiki/Color_blindness#Frequency_of_r...](https://en.wikipedia.org/wiki/Color_blindness#Frequency_of_red-
green_color_blindness_in_males_of_various_populations)

~~~
zymhan
Wait, I'm more likely to be color blind because of my ethnicity? Well that's
interesting...

EDIT: The article says "Arabs (Druzes)". So, if they're specifying the Druze
ethnicity, does that have any implications for Arabs in general?

~~~
mjevans
They probably don't know. It's very likely that a subsample of that ethnicity
had good medical data and thus conclusions can only be drawn conclusively for
that subsample. However it would indicate an area for study should more
reliable sample sizes of other 'near' populations be possible.

------
dsr_
Well, it's pretty. It's probably great if you have one to five machines you
care about, or you really want a pretty dashboard.

Notable features that I would need all relate to multi-server usage:

\- central config across hosts

\- alerting when values go over or under thresholds

\- a mode for automatically selecting and viewing the machines which are
working hardest, or not working

\- a mode for viewing of a few stats across all machines

\- a mode for slide-show viewing of a few stats across all machines

~~~
izacus
Yes, for us that do have one to five machines this is awesome, because most
other monitoring solutions are really annoying to deploy because they presume
more than five machines :)

~~~
lazylizard
how about amon.cx or mmonit? or if there's no need to self host, newrelic?

------
sagichmal
[https://github.com/firehol/netdata/wiki/Installation#nodejs](https://github.com/firehol/netdata/wiki/Installation#nodejs)

> I believe the future of data collectors is node.js

:(

~~~
illumin8
I'm sure most of us can assume reasons why collecting data with node.js might
be "wrong," but it would be more helpful to the conversation if you would
spell out specific reasons why using node.js for this use case is not optimal,
instead of just commenting with a single emoji.

~~~
otterley
Because it's just not necessary, and meanwhile most experienced SREs and
performance engineers are imperative programmers. We don't see the benefits of
using JavaScript on our servers, and that's rather putting it mildly.

Also JavaScript doesn't have a 64-bit integer type, which is absolutely
necessary to properly support large counters. There are workarounds but the
fact that you need one is ridiculous.

Many of us believe that the only reason one would want to use JavaScript is
because you're doing browser programming and therefore you have no other
choice.

~~~
NietTim
Non of what you just said was an actual argument against node.js though. All
you said was "Well, we don't like it"

Why don't you like it? What's wrong with it? What would be better?

~~~
crdoconnor
It's a weakly typed language and it exhibits all kinds of bizarre not-very-
well-thought out behaviors.

c.f. the disaster with npm a few days ago.

These problems are particularly pernicious for large scale apps - which this
isn't, currently, but probably aspires to be one day.

Any strongly typed language would be better (python, go, ruby, etc.).

~~~
geofft
Can you explain how the npm incident is related to typing? I am an advocate of
strong typing and love seeing it be the answer to all sorts of problems (e.g.
concurrency), but I'm not understanding the connection here.

~~~
crdoconnor
That wasn't related to js's weak typing and I didn't say that it was, but it's
indicative of the types of other problems that will occur.

------
clarkevans
Is there such a thing as 95% threshold CPU monitoring?

Consider an application spikes (close to 100% on a core) for 2-3s on some web
requests -- let's assume this is normal (nothing can be done about it). Now,
let's consider the average user of the system is idle for 2 minutes per web
request. So, users won't see performance degradation unless $(active-users) >
$(cores) during a 2-3 minute window.

For most monitoring systems, CPU is reported as an average over a minute, and,
even if it's pinned only 2-3s per 60s, that's only 5% usage. Presume a 2 CPU
system with 5 users, who all happen to be in a conference call... and hitting
the system at exactly the same time (but are otherwise mostly idle). The CPU
graph might show 10-15% usage (no flag). Yet, those 5 users will report
significant application performance issues (one of the users will have to wait
6-9s).

What I'd like to monitor, as a system administration, is the 95% utilization
of the CPUs -- that is, over the minute, throw away the bottom 94% (mostly
idle cycles) and report to me the CPU utilization of the next highest
percentile. This should show me those pesky CPU spikes. Anything do that?

~~~
barrkel
You're more interested in response time than CPU usage; the customer is having
a bad time whether the slow responses are due to CPU or IO or bad weather.

~~~
clarkevans
That's another (important) measure, but, not the one I'm interested in. In the
very short run, I can increase CPUs if there is hardware contention. The
response time is an indicator of, but doesn't prove, CPU starvation. Request
response time analysis is a longer term quality of service measure to identify
and fix application (if it is even feasible).

...

There's a reason why ISPs bill at the 95% -- it's directly correlated with
costs of over-subscribed resources. If you're running a VM cluster, you've got
a similar issue, only with CPUs (and memory) as well as than bandwidth. Most
operational divisions don't have the luxury of fixing a vendor application.
Instead, they have limited variables to play with: CPU cores and memory being
the coarse levers. While one could argue this is an "application" issue and
you'd be correct, it's irrelevant to my question.

I'm asking what tools are available for system administrators to diagnose and
address CPU starvation (under spiked usage). Current tools and techniques I'm
aware of don't seem to measure this.

~~~
wang_li
The only metric that matters is the measure of the business task at hand. You
might dig into CPU utilization after you've identified a problem with your
application, but trying to identify an application problem by measuring CPU is
like trying to determine where your shipment is by looking at engine RPM of
the freight vehicle.

That being said, historic CPU utilization is a useful metric for capacity
planning.

------
oxplot
Gave it a try. Definitely not useful for running the daemon and view the UI on
the same machine. Chrome at least eats 50% of one of the cores to show the
realtime data.

On my RPi B, the daemon eats 4% average on all four cores, with almost all the
time spent in the kernel. I assume polling the various entries under /proc/ is
costly.

------
Wilya
The dashboard is gorgeous, one of the prettiest I've ever seen.

But I wish it were a Riemann/Graphite/whatever dashboard instead of
reimplementing its own data collection system.

There is a need for great dashboards, but I don't feel any need for yet
another format of data collection plugins.

------
lobster_johnson
Interesting! Really gorgeously rendered dashboards.

But also weird. The fact that both the collectors, the storage _and_ the UI
runs on each box makes this more like a small-scale replacement for top and
assorted command-line tools such as iostat than for a scalable, distributed
monitoring system. Lack of central collection means you cannot get a cluster-
wide view of a given metric, nor can you easily build alerting into this.

I'm also disappointed that it reimplements a lot of collectors that already
exists in mature projects like Collectd and Diamond (and, more recently,
Influx's Telegraf). I understand that fewer external dependencies can be
useful, but still, does every monitoring tool really need to write its own
collector for reading CPU usage? You'd think there would be some
standardization by now.

For comparison, we use Prometheus + Grafana + a lot of custom metrics
collectors. Grafana is less than stellar, though. I'd love to have this UI on
top of Prometheus.

------
sleepyhead
Can we please stop with the "done right"?

------
glittershark
Having a custom plugin architecture for this is a total dealbreaker. We
already have statsd, why not just use that?

~~~
ktsaou
Well, I believe performance monitoring should be realtime. I optimized netdata
for this. A console killer! A tool you can actually use instead of the console
solution. It is not (yet) a replacement for any other solution.

------
thesorrow
Monitoring without alerting is kinda useless. How can I aggregate multiple
servers ?

~~~
thrownaway2424
Monitoring without alerting is not useless. It's just instrumentation.

------
gedrap
At the moment (like, literally now, just took a break and saw this), I am
configuring graphite + collectd + grafana (and probably cabot on top for
alerts), using ansible to set up collectd and sync the configuration across
the nodes.

After some time of using graphite + statsd and friends, I came to really
appreciate the benefits of using widely adopted open source components and the
flexibility it gives over all-in-one solutions such as this. On the other
hand, solutions like this are much easier to configure, especially the first
time when you are not familiar with the tools yet.

------
wyldfire
It's great that they've got all that explanatory prose for the metrics. That
would help when reviewing data with other team members who aren't familiar
with the context of each of these.

I have less of a realtime system review need than a post-mortem need. Today,
I'll use kSar to do that, but this tool looks much more capable.

It's too bad that it doesn't provide an init script or other startup feature.
The installer, while it doesn't seem to follow typical distribution patterns,
is otherwise fairly complete.

~~~
ktsaou
Nice you like it.

Init scripts are in the system directory.

I have already decoupled the code for handling the chart comments, but it
needs some more work to remove it from the dashboard html and put it in
separate json files.

------
guiye
very nice look and feel, but it's doing http polling each second, maybe using
websockets or SSE could perfom better, great work!

------
kabdib
Did some spot checking. Found a race condition in the dictionary code in less
than five minutes of poking around. Ugh.

Edit: code to add an entry to the dictionary releases its lock, whereupon you
can wind up with duplicate NV pairs.

~~~
haswell
This would be a more constructive comment if you provided more details or
perhaps submitted an issue in GitHub and shared it here.

Why does "found a bug" equate to an immediate "ugh"? Is the expectation that
projects submitted here are perfect?

~~~
kabdib
I think it was more like the OP lead with his or her chin. "Done right?
Okay..."

It's also true that pretty much every piece of monitoring software I encounter
makes me sad when I look under the hood. And the proprietary ones that I've
seen under the hood of . . . hoo boy.

------
amelius
It would be nice if it could show the processes that were running at the time
of a peak in the graph.

Also, it would be nice if this could be run over multiple machines and show
combined results.

Further, it appears that this tool shows information that other tools
currently do not show. Perhaps nice if this tool allowed scripting and/or a
CLI.

~~~
ktsaou
You can see application resource usage at the applications section. This
groups the whole process tree and reports cpu, memory, swap, disk, etc usage
per process group. The grouping is influenced with /etc/netdata/apps.conf.

Dashboards can already have charts from multiple servers, but the UI for
configuring this is missing. If you build a custom dashboard yourself, it can
be done (just a div per chart - no javascript from your side).

Regarding scripting, please open a github issue to discuss it. I really like
the idea.

------
vbtechguy
netdata is perfect for single server monitoring it's perfectly suited to
integration into my Centmin Mod LEMP stack installer
[https://community.centminmod.com/threads/addons-netdata-
sh-n...](https://community.centminmod.com/threads/addons-netdata-sh-new-
system-monitor-addon.7022/).

For folks wanting multiple servers, the wiki does mention that i believe at
[https://github.com/firehol/netdata/wiki#how-it-
works](https://github.com/firehol/netdata/wiki#how-it-works)

------
ausjke
Impressive. The dashboard can be a bit condensed though, put all details on
one page is a little overwhelming, maybe have some
tabs(cpu,memory,disk,network,etc)?

------
rodionos
nmon gives you console beauty without external dependencies. You can watch it
in console mode and cron schedule it in batch mode for long-term data
collection.

------
notinventedhear
This looks really useful, although it doesn't seem to have a dashboard for
showing the aggregated results from multiple running daemons.

~~~
ktsaou
You can build this using HTML. Check
[https://github.com/firehol/netdata/wiki/Custom-
Dashboards](https://github.com/firehol/netdata/wiki/Custom-Dashboards) No
javascript necessary from your part. Just a div per chart (each coming from a
different server).

------
brndn
Would implementing something like this on a server have any noticeable
performance impact?

~~~
dsr_
There are performance notes in the wiki. As I understand it, CPU is unlikely
to be impacted much, but you can increase the history setting away from the
default and eat all your RAM.

Also, running plugins can increase CPU usage.

------
romanovcode
Pretty cool, does it also auto-update itself? I also think it's a bit
cluttered.

~~~
aroch
I'm guessing it doesn't update its own packages (that'd be alittle odd), but
it looks to be using nodejs stuff in the background so apt-get keeps that up-
to-date.

Also looks like you can turn off plugins you don't want/need:
[https://github.com/firehol/netdata/wiki/Configuration](https://github.com/firehol/netdata/wiki/Configuration)

------
igama
Looks pretty cool, going to test it soon.

------
jjuhl
I'd recommend people to also check out SysOrb :
[http://sysorb.com/](http://sysorb.com/)

------
crudbug
+1 great work.. would love to see React port.

~~~
pmlnr
It says fast, so no React, please.

~~~
crudbug
That is debatable and more opinionated.

~~~
pmlnr
Opinionated by not being on the newest hardware ;)

~~~
binaryblitz
Ok Tim Cook.

