
Introducing Vector: Netflix's On-Host Performance Monitoring Tool - r4um
http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
======
philsnow
I got down to

    
    
         you should be able to install PCP from binary packages made available by the PCP development team on:
    
        ftp.pcp.io
    

and that threw up a red flag in my brain. Then I noticed that
techblog.netflix.com doesn't redirect to https (and indeed can't serve https)
(I use [https://www.eff.org/HTTPS-EVERYWHERE](https://www.eff.org/HTTPS-
EVERYWHERE) and did not get content over https).

The directions _I_ saw for building from source looked pretty innocuous, but
you might see a different set of directions if you're being MITMed. Observe an
appropriate amount of caution.

~~~
mspier
Happy to discuss your concerns. PCP 3.10 should be available on Ubuntu's
official repo pretty soon too.

~~~
harshreality
His concerns seem plain to me. Unauthenticated channels for software
distribution or software installation instructions are bad.

The techblog isn't using SSL, and the git pull url for PCP is using the git
protocol which is also unauthenticated, rather than the authenticated https
transport (ssh is only an option when user accounts make sense).

Someone's at a conference and follows the link over public wifi. They get the
same page but with "here's how to get PCP: ftp evil.io or git clone
git://git.evil.io/pcp" Even if the webpage were ssl-enabled so that an
attacker can't rewrite the pcp.io links, an attacker or evil network operator
could MITM git.pcp.io or ftp.pcp.io. (FTP?!)

Being in Ubuntu's repo doesn't make it safe if Ubuntu's maintainers have no
(semi-)trustworthy way of getting the code.

~~~
justizin
Ubuntu's maintainers can check the MD5SUM file on ftp.pcp.io:

    
    
      ftp://ftp.pcp.io/projects/pcp/download/MD5SUM
    

The project seems to be hosted by Red Hat these days.

~~~
willglynn
FTP is just as unauthenticated as everything else above, so having MD5SUMs
available over FTP doesn't really change the situation.

------
CWuestefeld
It's enormously frustrating to me that the Windows platform is so far behind.

I've been researching for a rebuild of our ecommerce site with the idea of a
modern microservice architecture. Team skillset and some other considerations
dictate a .Net environment.

Netflix and other companies have created a really rich platform, between
logging and monitoring technologies like this; message queueing; deployment;
and so on. In the Windows environment there's precious little to compare to.

One bright spot is the news story also on HN today [1] about MS announcing
Docker-related Hyper-V technology - but in the next version of Windows server.

It might be that to satisfy those .Net compatibility wishes, we just go to
Mono and do everything else in Linux.

[1]
[https://news.ycombinator.com/item?id=9342369](https://news.ycombinator.com/item?id=9342369)

~~~
wantab
The web was built, and runs on, *nix/BSD. Windows is an outlier. Windows can't
even get the slashes going the right way. There's a reason 80% of internet
traffic does not use Windows.

~~~
CWuestefeld
I'm not sure how that was supposed to be helpful.

I also don't think it's strictly true. For sure the underlying networking came
from Unix. But as for the web itself, once we got to real dynamic content, it
seems to me that Microsoft were the ones that got things moving.

While the Unix world was mired in the awful world of CGI, Microsoft gave us
high-performance ISAPI, and then Cold Fusion (also on the Windows platform)
and Microsoft's ASP made programming a little more sane. While the Unix world
tried to deal with JSP (which IMHO wasn't a very good solution), the Microsoft
platform seems to have been the innovators for several years, until Ruby on
Rails and then node.js and stuff started coming out.

Today, the Apache server powers far more sites than any other, it's true. But
IIS shares the 2nd-place spot [1]. When you say "There's a reason 80% of
internet traffic does not use Windows", that's pretty much true for the
servers, but far off the mark when counting clients. And the reason for that
is that Microsoft's strength hasn't historically been radical advances, but in
figuring out ways to take the bleeding edge tech that doesn't really work
quite right yet, and packaging it into commodity software that may not be as
sexy as envisioned by those with the original ideas, but actually useful to
the average guy.

[1] [http://www.zdnet.com/article/web-servers-microsoft-iis-
and-n...](http://www.zdnet.com/article/web-servers-microsoft-iis-and-nginx-
battle-for-second-place/)

EDIT - missing word "world" in 3rd para

~~~
wantab
Your examples presume there was something better than CGI at the time and the
other products are better than something else. For one, I wouldn't be caught
dead using any of the products you mentioned.

You claim IIS by using a ZDNet article from two years ago but the reality is
IIS is number three behind Nginx and Apache. Still, being a distant #2 is
nothing to brag about.

Now you're trying to claim clients are what powers the web but that's not the
topic. What an amateur uses does not define what the professional uses. And to
claim not wanting to be on the bleeding edge of things is no excuse for
falling behind. Firefox and Chrome knocked IE off its perch years ago by being
on the bleeding edge.

------
fapjacks
I just tried installing this on a test VPS. PCP went well, but whenever I
tried to input a "hostname" in Vector's UI to get stats, it just kept telling
me it couldn't connect and to check hostname, regardless of what I put in the
box. PCP was available and of course ports open and whatnot. I'm not sure what
the problem was, but that was a bit frustrating. It looks like a slick product
otherwise, especially for a first version! Thanks for releasing projects like
this! I'll be trying again in the future, for sure.

~~~
mspier
Can you open an issue on GitHub and post more details? Any errors on the
JavaScript console?

------
samstave
After using Stackdriver for the last 18 months, I would never go back to
rolling my own monitoring infra if I could avoid it. I had nagios, cacti,
munin, graphite all running and had two ops guys pretty much 80% of their time
managing it.

Stackdriver with pagerduty and I have 250,000 custom metrics being published
and hundreds and hundreds of graphs on dozens of dashboards.

Although, I am looking at SignalFX to give even better version of this, but I
manage nearly 1,000 machines with only a staff of four.

~~~
makeitsuckless
Never heard of it, so checked out Stackdriver. Seemed very interesting,
especially since it's geared at AWS (which we use).

Until I noticed the big red flag on the front page _" Stackdriver is now part
of Google."_

Surely Google is going to kill this product and use the knowledge to offer
something similar for Google's cloud services.

~~~
samstave
We were concerned when.Google bought them and we had many meetings with them
about this, I do not believe that will happen, but if that is still an issue
for you, use signalFX.

------
rubiquity
Am I wrong in saying this is a web interface that wraps PCP? So you can't
really compare it to inspeqtor or collectd since those actually do the metrics
collection.

~~~
mspier
True. But that's the first release of Vector. We expect to make our custom PCP
agents public soon.

~~~
rubiquity
Sorry I should clarify I was making that statement for my understanding, not
to degrade the project!

------
debaserab2
What does this have that CloudWatch enhanced metrics doesn't? From the
screenshots, the metrics look pretty similar. Not a slight at all against this
project (it looks awesome), I'm just curious if your infrastructure is already
AWS based what would cause you to choose a non-CloudWatch option.

~~~
toomuchtodo
CloudWatch charges you per instance if you want 1 minute metrics instead of
the standard (free) 5 minute metrics. Any tool that collects the data for you
gets you out of that $3.50/instance/month detailed monitoring charge [1].

[1]
[http://aws.amazon.com/cloudwatch/pricing/](http://aws.amazon.com/cloudwatch/pricing/)

------
Thaxll
Not sure what the difference is between that and Collect + Graphite /
InfluxDB.

~~~
nathan_scott
There are many, many differences - there's some books about PCP that might
help clarify PCP design points, see:
[http://pcp.io/documentation.html](http://pcp.io/documentation.html)

------
capkutay
The charts look like they were built with nvd3. Can anyone confirm/deny?

~~~
mspier
That's right. Any suggestion of better reusable charts?

~~~
forrestthewoods
Honestly? Just write your own SVG code. If you inspect the elements the output
data is super simple and easy to understand. NVD3 just wraps D3.js which just
wraps utilities that output relatively basic data. Well, d3js is a data-
binding system that's way too god damned complicated if all you want to do is
make some simple charts which is how it's used 99% of the time.

I've spent more time trying to manipulate chart libraries into doing almost
the same thing but just different enough to cause pain and suffering. Output
your own path data and it's a million times easier.

For reference here's what I made: [http://forrestthewoods.com/unbalanced-
design-of-super-smash-...](http://forrestthewoods.com/unbalanced-design-of-
super-smash-brothers-part-3/)

------
tdicola
Wow this looks great, thanks for releasing it! Nice to see even 'simple' stuff
like this that can help people who aren't running at Netflix scale is still
released.

------
vizzah
Could anyone who tried this monitoring tool compare it to Munin?

------
victorhooi
This is a slight tangent...but does anybody know what UI toolkit Netflix is
using for this?

Or if it's in-house, any info on whether they have, or might release it?

I see bootstrap-submenu.css mentioned, but not Bootstrap itself:

[https://github.com/Netflix/vector/tree/master/app/css](https://github.com/Netflix/vector/tree/master/app/css)

~~~
mspier
It's bootstrap (see bower.json dependencies), but with our own layer on top of
that.

------
xbryanx
I'm curious how this compares to Zabbix's agent and server. Does PCP give you
finer grained details, or is it possibly more lightweight?

~~~
nathan_scott
PCP is much finer-grained than Zabbix in terms of the metrics it makes
available (esp. from the Linux kernel); not sure on Zabbix costs but PCP is
quite light on all resources (mem, cpu, net) and very robust.

I've worked on production systems where everything else was failing (hardware,
kernel, applications) but PCP kept chugging along, recording and telling the
sad story to anyone that would listen.

------
strunz
Anyone care to compare this to something like collectd?
[https://collectd.org/](https://collectd.org/)

~~~
mspier
Goal is a bit different. Vector doesn't collect and persist metrics. We needed
something that had as little overhead as possible so it could be deployed to
all our hosts and simplify the process of analyzing those metrics.

~~~
toomuchtodo
If its not collecting and persisting metrics, is it more of a glorified htop?

~~~
davidu
Wait, really?

Why not? Storage is cheap. Do you use something else to get historical
visibility into metrics?

~~~
brendangregg
Yes, Atlas, which is also open source:
[http://techblog.netflix.com/2014/12/introducing-atlas-
netfli...](http://techblog.netflix.com/2014/12/introducing-atlas-netflixs-
primary.html) . Atlas monitors cloud-wide, and stores historical metrics at a
one minute granularity.

Vector is for per-instance custom drilldowns. I gave a talk last year where I
showed how they both fit together:
[http://www.brendangregg.com/blog/2014-09-27/from-clouds-
to-r...](http://www.brendangregg.com/blog/2014-09-27/from-clouds-to-
roots.html)

~~~
davidu
Got it... and thank you Brendan!

------
oimaz
Where can I find deb packages for pcp version 3.10 or higher?

~~~
jmedefind
The git repo has a MakePkg script that will generate deb packages for you.

I found it really easy to use.

