Hacker News new | past | comments | ask | show | jobs | submit login
Netdata – Linux performance monitoring, done right (github.com/firehol)
477 points by cujanovic on March 30, 2016 | hide | past | favorite | 87 comments

Looks like another faster horse. A pretty GUI on /proc is not the most burning issue to solve in Linux performance monitoring. I wish anyone making these tools would spend 30 minutes watching my Monitorama talk about instance monitoring requirements at Netflix: http://www.brendangregg.com/blog/2015-06-23/netflix-instance... . I still hate gauges.

Where is the PMC support? At Facebook a few days ago, they said their number one issue was memory bandwidth. Try analyzing that without PMCs. You can't. And that's their number one issue. And it shouldn't be a surprise that you need PMC access to have a decent Linux monitoring/analysis tool. If that's a surprise to you, you're creating a tool without actual performance expertise.

Should front BPF tracing as well... Maybe it will in the future and I can check again.

First off, I think this is actually designed mainly for embedded applications. It's the only reason I can think of that they'd make a single self-contained monitor that only works for one host.

Second, it's great that you want a car instead of a faster horse, but I want 10 horses. Why?

  - 1 horse trained for 1 job easier & more reliable than 1 horse/car trained for 10
  - 10 horses more flexible than 1 horse or 1 car
  - if a horse isn't doing its job, shoot it and replace it
  - don't need to go to horse school to learn how to use or maintain one
  - don't need to hire an engineer to teach horse new trick
  - most problems can be solved with a horse
I get that with some applications, having a couple extra minutes saved by your monitoring system can mean millions of dollars saved. For those cases, there will always be custom-tailored solutions because there is value there. For 99% of the rest of the time, you get more value from simple tools that can be combined to do their jobs well.

If instead of focusing on the transportation we were focusing on the road, we could build a foundation for horses and cars to coexist. I'd love to see more protocols and specifications for each kind of monitor (other than syslog, RRD & SNMP) so they could simply be made according to spec and we could mix and match as we wished.

They are more than welcome to use FreeBSD's PMC support to roll their own implementation. But I predict they won't until another decade of beating their heads against broken tools and non engineered solutions, light through yonder window will finally break.

Unrelated: In the middle of your Performance book, wish I had studied it several years ago.

For this tool I wonder what the observer's effect is.

Just curious; how closely you are involved in Vector development? Do you think Vector tackles the issues you mention?

Yes, I'm involved, and no, I don't think the current public release of Vector tackles many of the issues yet. But we've been working on them and will have it released when we can. Items include:

- PMCs: Vector's backend is pcp, which has a Linux perf_events pmda for PMC support. There's no PMC counters by default in Vector since we're using it in an environment without PMC access. But I've been heavily working in this area. More later on.

- Flame graph support: we already have it and use it. Ongoing work includes new flame graph types, and a rewritten flame graph implementation in d3. Need to get it all published.

- Heat map support: just solved an issue with them; again, an area where we have ad hoc tools that work and bring value, but haven't wrapped it all in Vector yet.

- BPF support: I've been prototyping many new metrics that we want in Vector, https://github.com/iovisor/bcc#tools. There'll be increased demand to get Vector accessing these in a couple of months or so, when we have newer kernels in production that have BPF.

Yep if you aren't looking at eBPF now you probably aren't doing it right. And BTW thanks for that Brendan.

Don't use the red-green combination in charts as it makes it really hard to read for those of us with a degree of red-green color blindness (which is the most common type in the ~5% of the male and ~1% female population that has it).

Other than that it looks AWESOME.

The female rate is closer to 0.5%, the male rate varies wildly based on race [0]:

Fiji Islanders: 0.8% vs Arab: 10%

So poor design is racist :)

[0] https://en.wikipedia.org/wiki/Color_blindness#Frequency_of_r...

Wait, I'm more likely to be color blind because of my ethnicity? Well that's interesting...

EDIT: The article says "Arabs (Druzes)". So, if they're specifying the Druze ethnicity, does that have any implications for Arabs in general?

They probably don't know. It's very likely that a subsample of that ethnicity had good medical data and thus conclusions can only be drawn conclusively for that subsample. However it would indicate an area for study should more reliable sample sizes of other 'near' populations be possible.

I would say rather, provide an option to change colors for those with various types of color blindness. That way it's accessible but still intuitive to those who can discern and are familiar with typical meanings of red and green.

Well, it's pretty. It's probably great if you have one to five machines you care about, or you really want a pretty dashboard.

Notable features that I would need all relate to multi-server usage:

- central config across hosts

- alerting when values go over or under thresholds

- a mode for automatically selecting and viewing the machines which are working hardest, or not working

- a mode for viewing of a few stats across all machines

- a mode for slide-show viewing of a few stats across all machines

Yes, for us that do have one to five machines this is awesome, because most other monitoring solutions are really annoying to deploy because they presume more than five machines :)

how about amon.cx or mmonit? or if there's no need to self host, newrelic?

Maybe "monitoring" is a poor word choice when there are no alerting/notification capabilities or statics-rollup. But there are different tools out there to do just that.

This seems super useful to zoom in one some performance characteristics of a single server to debug some issues!

Agreed, this is nicer way to see what's happening when something happen.. and if you only have one server haha.

For historical "monitoring", alerts, I still need my NewRelic / ELK kind of tools.

so from "Introducing-netdata" on their Wiki, there's this excerpt..

With netdata, there is no need to centralize anything for performance monitoring. You view everything directly from their source. Still, with netdata you can build dashboards with charts from any number of servers. And these charts will be connected to each other much like the ones that come from the same server.

So seems like this is on purpose. And if I understand this correctly, I can just build custom HTML dashboard to connect to multiple machines?

check out http://riemann.io/. You just have to build your own dashboard.

Your casual tone makes it sound so easy. If only that were true!

I believe he may have meant "glue one together from parts found on Github".

Case in point: https://github.com/Shopify/dashing

Not to knock on this project, I use it and love it, but the learning curve involved here is not trivial. Using it basically requires that you're an accomplished frontend developer, and all this to just display some metrics on a page.

Here's the presumed knowledge:

    * Ruby
    * Sinatra
    * CofeeScript
    * JavaScript
    * HTML
    * CSS
    * SCSS
    * Rufus
    * Sprockets
Nine different languages and libraries! Jesus H. Christ!

It's plain to see why outfits like Splunk can get away with charging as much as they do - visualizing metrics in that app is as simple as installing a deb package, logging in, and pointing it at your data source.

Setting up Dashing by hand is comparatively ...difficult.

This is why we pay companies like Datadog and SignalFX to do it right.

...or merrily trudge along with Zabbix, or curse under the cruel rule of Nagios...

Wavefront does not do a lot of marketing right now, but they have a great product

You forgot Clojure. Riemann's monitoring config is written in Clojure.

There's a much more approachable all-JS Riemann clone called Godot that I've successfully used on multiple projects, but it still requires some work to make the frontend look good.

I saw the list and thought "hey I can manage!" ... Then Clojure came along...


> I believe the future of data collectors is node.js


I'm sure most of us can assume reasons why collecting data with node.js might be "wrong," but it would be more helpful to the conversation if you would spell out specific reasons why using node.js for this use case is not optimal, instead of just commenting with a single emoji.

Because it's just not necessary, and meanwhile most experienced SREs and performance engineers are imperative programmers. We don't see the benefits of using JavaScript on our servers, and that's rather putting it mildly.

Also JavaScript doesn't have a 64-bit integer type, which is absolutely necessary to properly support large counters. There are workarounds but the fact that you need one is ridiculous.

Many of us believe that the only reason one would want to use JavaScript is because you're doing browser programming and therefore you have no other choice.

I've seen plenty of server-based applications written in node.js. Just because javascript used to be pigeon-holed into front-end doesn't mean it can't perform well on server applications.

A good example of a node.js app that is purely server side is Hubot, a chat bot created at Github and widely used in Slack, HipChat, IRC, etc. I'm sure there are thousands of others out there, and I don't believe being written in Javascript gives them any fundamental disadvantage to server applications written in Python, Java, C++, or any other language.

"If you have a hammer, everything starts to look like your thumb."

JavaScript is reasonably suited to servers written around event loops like Hubot. But as a general purpose programming language I think it's reasonable to argue that it's pretty bad compared to alternatives like Erlang and Go in the space it's used in. Also debugging is a pain in the ass because stack traces aren't useful at all. (That's not limited to JS but it does make my life harder.)

In any event, we're talking about a telemetry collector, which is generally a trivial piece of code that simply doesn't need whatever benefits JS purports to provide.

For systems programming though? Running on a server is not the only similarity you should be looking at.

Non of what you just said was an actual argument against node.js though. All you said was "Well, we don't like it"

Why don't you like it? What's wrong with it? What would be better?

It's not about Node per se; it's about JavaScript. System telemetry collectors don't benefit much from languages built around event loops.

Many of us already know C, Bourne Shell and Python and probably another scripting language and all of those can get telemetry data off systems and into an event bus or metrics aggregator quickly enough. Adding JavaScript adds complexity without giving us significant new functionality.

Now the data collector (gathering metrics from many senders for aggregation and storage) is a different story. I've seen them written in JavaScript but the 64-bit counter difficulties would rule it out for me were I to implement one again.

> All you said was "Well, we don't like it"

That's not what I see.

It's a weakly typed language and it exhibits all kinds of bizarre not-very-well-thought out behaviors.

c.f. the disaster with npm a few days ago.

These problems are particularly pernicious for large scale apps - which this isn't, currently, but probably aspires to be one day.

Any strongly typed language would be better (python, go, ruby, etc.).

Can you explain how the npm incident is related to typing? I am an advocate of strong typing and love seeing it be the answer to all sorts of problems (e.g. concurrency), but I'm not understanding the connection here.

That wasn't related to js's weak typing and I didn't say that it was, but it's indicative of the types of other problems that will occur.

Not parent, but IMHO requiring nodejs is heavy for something as simple as data collection. I wouldn't mind if plugins (sufficiently complex ones) can be written in nodejs, but requiring nodejs is too much.

I'd prefer something like Lua (and C, of course), or even Python (which is installed on most systems anyway, but still too heavy).

Is there such a thing as 95% threshold CPU monitoring?

Consider an application spikes (close to 100% on a core) for 2-3s on some web requests -- let's assume this is normal (nothing can be done about it). Now, let's consider the average user of the system is idle for 2 minutes per web request. So, users won't see performance degradation unless $(active-users) > $(cores) during a 2-3 minute window.

For most monitoring systems, CPU is reported as an average over a minute, and, even if it's pinned only 2-3s per 60s, that's only 5% usage. Presume a 2 CPU system with 5 users, who all happen to be in a conference call... and hitting the system at exactly the same time (but are otherwise mostly idle). The CPU graph might show 10-15% usage (no flag). Yet, those 5 users will report significant application performance issues (one of the users will have to wait 6-9s).

What I'd like to monitor, as a system administration, is the 95% utilization of the CPUs -- that is, over the minute, throw away the bottom 94% (mostly idle cycles) and report to me the CPU utilization of the next highest percentile. This should show me those pesky CPU spikes. Anything do that?

I don't really understand your question. Fundamentally a process is either on the CPU, or not. There's nothing between 0 and 100% CPU usage. CPU utilization only makes sense with a windowing function, for example in 90% of samples over the last second, some process was on the CPU.

I think what you want for your application, if I understand any of it correctly, is the time spent by a runnable process waiting to get scheduled on a CPU. That would indicate contention. If a process runs immediately when it becomes runnable, then there's no contention, no matter what the windowed utilization looks like.

You're more interested in response time than CPU usage; the customer is having a bad time whether the slow responses are due to CPU or IO or bad weather.

That's another (important) measure, but, not the one I'm interested in. In the very short run, I can increase CPUs if there is hardware contention. The response time is an indicator of, but doesn't prove, CPU starvation. Request response time analysis is a longer term quality of service measure to identify and fix application (if it is even feasible).


There's a reason why ISPs bill at the 95% -- it's directly correlated with costs of over-subscribed resources. If you're running a VM cluster, you've got a similar issue, only with CPUs (and memory) as well as than bandwidth. Most operational divisions don't have the luxury of fixing a vendor application. Instead, they have limited variables to play with: CPU cores and memory being the coarse levers. While one could argue this is an "application" issue and you'd be correct, it's irrelevant to my question.

I'm asking what tools are available for system administrators to diagnose and address CPU starvation (under spiked usage). Current tools and techniques I'm aware of don't seem to measure this.

The only metric that matters is the measure of the business task at hand. You might dig into CPU utilization after you've identified a problem with your application, but trying to identify an application problem by measuring CPU is like trying to determine where your shipment is by looking at engine RPM of the freight vehicle.

That being said, historic CPU utilization is a useful metric for capacity planning.

Once you've identified long response times, it would also be possible to observe that the threads handling requests spent a good percentage of time waiting for CPU, which would point to CPU saturation as the problem. I'm not sure how you do this on GNU/Linux, but you can assess this on illumos with ptime(1) or prstat(1M).

I think your concern about measuring CPU utilization is real, though. You can use frequent sampling and present samples on a heat map to deal with this problem. There are some examples here: http://www.brendangregg.com/HeatMaps/utilization.html

Don't you overestimate an evening with strace and tcpdump a little bit as a "quality of service measure"? In my experience the reasons for high request response times are very easy to find, but not necessarily easy to fix. So at best it takes 30 minutes to find the root of the problem and 30 minutes to come up with a fix, at worst it takes 30 minutes to find and half a man-year to fix an entire stack.

I am not aware of such tool. It is a good idea to report p50, p90, p99, p99.9 for CPU utilization but I also think it is good to report avg as well. Application performance issues should be monitored from the app point of view (response latency) and if you have that you can drill down and pin point CPU issues. Generally speaking it is better to monitor metrics from upper layers than just look at OS graphs. Most of the monitoring systems cover these in one dashboard so you can easily track down issues.

Gave it a try. Definitely not useful for running the daemon and view the UI on the same machine. Chrome at least eats 50% of one of the cores to show the realtime data.

On my RPi B, the daemon eats 4% average on all four cores, with almost all the time spent in the kernel. I assume polling the various entries under /proc/ is costly.

The dashboard is gorgeous, one of the prettiest I've ever seen.

But I wish it were a Riemann/Graphite/whatever dashboard instead of reimplementing its own data collection system.

There is a need for great dashboards, but I don't feel any need for yet another format of data collection plugins.

Interesting! Really gorgeously rendered dashboards.

But also weird. The fact that both the collectors, the storage and the UI runs on each box makes this more like a small-scale replacement for top and assorted command-line tools such as iostat than for a scalable, distributed monitoring system. Lack of central collection means you cannot get a cluster-wide view of a given metric, nor can you easily build alerting into this.

I'm also disappointed that it reimplements a lot of collectors that already exists in mature projects like Collectd and Diamond (and, more recently, Influx's Telegraf). I understand that fewer external dependencies can be useful, but still, does every monitoring tool really need to write its own collector for reading CPU usage? You'd think there would be some standardization by now.

For comparison, we use Prometheus + Grafana + a lot of custom metrics collectors. Grafana is less than stellar, though. I'd love to have this UI on top of Prometheus.

Can we please stop with the "done right"?

Having a custom plugin architecture for this is a total dealbreaker. We already have statsd, why not just use that?

Well, I believe performance monitoring should be realtime. I optimized netdata for this. A console killer! A tool you can actually use instead of the console solution. It is not (yet) a replacement for any other solution.

Monitoring without alerting is kinda useless. How can I aggregate multiple servers ?

You use zabbix :-)

Nice open source solution (everything open source, not "open core") with possibility to pay a reasonable (my words) amount for support.

Source: Use it at work, unsupported (because we are small and it just works without support at our current scale). Otherwise unaffiliated.

Monitoring without alerting is not useless. It's just instrumentation.

Not at all - historical data can be extremely useful.

At the moment (like, literally now, just took a break and saw this), I am configuring graphite + collectd + grafana (and probably cabot on top for alerts), using ansible to set up collectd and sync the configuration across the nodes.

After some time of using graphite + statsd and friends, I came to really appreciate the benefits of using widely adopted open source components and the flexibility it gives over all-in-one solutions such as this. On the other hand, solutions like this are much easier to configure, especially the first time when you are not familiar with the tools yet.

It's great that they've got all that explanatory prose for the metrics. That would help when reviewing data with other team members who aren't familiar with the context of each of these.

I have less of a realtime system review need than a post-mortem need. Today, I'll use kSar to do that, but this tool looks much more capable.

It's too bad that it doesn't provide an init script or other startup feature. The installer, while it doesn't seem to follow typical distribution patterns, is otherwise fairly complete.

Nice you like it.

Init scripts are in the system directory.

I have already decoupled the code for handling the chart comments, but it needs some more work to remove it from the dashboard html and put it in separate json files.

very nice look and feel, but it's doing http polling each second, maybe using websockets or SSE could perfom better, great work!

Did some spot checking. Found a race condition in the dictionary code in less than five minutes of poking around. Ugh.

Edit: code to add an entry to the dictionary releases its lock, whereupon you can wind up with duplicate NV pairs.

This would be a more constructive comment if you provided more details or perhaps submitted an issue in GitHub and shared it here.

Why does "found a bug" equate to an immediate "ugh"? Is the expectation that projects submitted here are perfect?

I think it was more like the OP lead with his or her chin. "Done right? Okay..."

It's also true that pretty much every piece of monitoring software I encounter makes me sad when I look under the hood. And the proprietary ones that I've seen under the hood of . . . hoo boy.

Hi, if you looked at dictionary.c, this is not used yet. But anyway, can you help me track it down?

It's open source on Github: make a pull request.

It would be nice if it could show the processes that were running at the time of a peak in the graph.

Also, it would be nice if this could be run over multiple machines and show combined results.

Further, it appears that this tool shows information that other tools currently do not show. Perhaps nice if this tool allowed scripting and/or a CLI.

You can see application resource usage at the applications section. This groups the whole process tree and reports cpu, memory, swap, disk, etc usage per process group. The grouping is influenced with /etc/netdata/apps.conf.

Dashboards can already have charts from multiple servers, but the UI for configuring this is missing. If you build a custom dashboard yourself, it can be done (just a div per chart - no javascript from your side).

Regarding scripting, please open a github issue to discuss it. I really like the idea.

netdata is perfect for single server monitoring it's perfectly suited to integration into my Centmin Mod LEMP stack installer https://community.centminmod.com/threads/addons-netdata-sh-n....

For folks wanting multiple servers, the wiki does mention that i believe at https://github.com/firehol/netdata/wiki#how-it-works

Impressive. The dashboard can be a bit condensed though, put all details on one page is a little overwhelming, maybe have some tabs(cpu,memory,disk,network,etc)?

nmon gives you console beauty without external dependencies. You can watch it in console mode and cron schedule it in batch mode for long-term data collection.

This looks really useful, although it doesn't seem to have a dashboard for showing the aggregated results from multiple running daemons.

You can build this using HTML. Check https://github.com/firehol/netdata/wiki/Custom-Dashboards No javascript necessary from your part. Just a div per chart (each coming from a different server).

Would implementing something like this on a server have any noticeable performance impact?

There are performance notes in the wiki. As I understand it, CPU is unlikely to be impacted much, but you can increase the history setting away from the default and eat all your RAM.

Also, running plugins can increase CPU usage.

Pretty cool, does it also auto-update itself? I also think it's a bit cluttered.

I'm guessing it doesn't update its own packages (that'd be alittle odd), but it looks to be using nodejs stuff in the background so apt-get keeps that up-to-date.

Also looks like you can turn off plugins you don't want/need: https://github.com/firehol/netdata/wiki/Configuration

Looks pretty cool, going to test it soon.

I'd recommend people to also check out SysOrb : http://sysorb.com/

+1 great work.. would love to see React port.

It says fast, so no React, please.

But .. but.. React is faster than JavaScript, isn't it? /s

That is debatable and more opinionated.

Opinionated by not being on the newest hardware ;)

Ok Tim Cook.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact