
Monitoring Cloudflare's edge network with Prometheus - kiyanwang
https://drive.google.com/file/d/0BzRE_fwreoDQNzUybnRFOHpWZTA/view
======
bebop
Here is the video if you want to watch the full talk:
[https://www.youtube.com/watch?v=lHtY7TUsLzk](https://www.youtube.com/watch?v=lHtY7TUsLzk)

~~~
MetalMatze
The talk originally given at PromCon 2017:
[https://www.youtube.com/watch?v=ypWwvz5t_LE](https://www.youtube.com/watch?v=ypWwvz5t_LE)

------
the_evacuator
Prometheus is an escaped implementation of Google’s borgmon, which is seen
inside Google as a kind of horror show, and alternatives have been developed.
It is kind of frightening that it has got out in the wild and people like it.

~~~
clhodapp
Why is borgmon considered a horror show? Is something fundamentally flawed in
the model? How do the alternatives differ from the original borgmon?

~~~
kyrra
Google has an alternative that they gave a talk on back in December. Sadly
there aren't any papers on it yet. It's called Monarch and it's what backs up
Stackdriver.

It's config language is less crazy (Python based) and operates globally.

[https://www.youtube.com/watch?v=LlvJdK1xsl4](https://www.youtube.com/watch?v=LlvJdK1xsl4)

Edit: Monarch config isn't sane, it's just different and at least not in the
crazy languages that borgmon uses.

~~~
alin-sinpalean
I will point out that borgmon's language (minus macros) is almost a 1:1 match
with Prometheus. You can judge for yourself how crazy that is, but I feel that
it's close to as simple as you can get for the power it gives you.

As for Monarch, it's a very different beast. For one, it stores all its rules
in a protocol buffer format, so it's more structured. But then you have to
write Python code that generates the protocol buffers and pushes them to
storage. It looks similar but not the same as the ad-hoc query language. I
wouldn't go as far as calling it sane.

It is also a service and it's optimized for Google's network architecture with
datacenter local and global nodes and the language itself is aware of this
distinction and some computations are done locally, others globally and so on.

For your local monitoring needs (or even global ones, if you're willing to put
in the effort), Prometheus is a solid choice.

~~~
kyrra
I'd agree with you after thinking about it some. I haven't really written
either, mainly either copy/paste or using tools to assist in creation. So I
can't really judge either monitoring language on their ease of use.

------
lima
I don't get the hype around Prometheus.

What makes its pull-based mechanism superior to push-based ones like statsd?

And using exporters sounds clunky - instead of directly querying a metric and
sending it to your metrics collector, you have an intermediate component which
exposes them for collection.

~~~
atombender
Prometheus _does_ support push. It's just that it's considered such an
antipattern that it's been moved into a separate module (the Push Gateway)
that you need to run separately.

Pulling has a few technical benefits, though. For one, only the puller needs
to know what's being monitored; the thing being monitored can therefore be
exceedingly simple, dumb and passive. Statsd is similarly simple in that it's
just local UDP broadcast, of course, which leads to the next point:

Another benefit is that it allows better fine-grained control over _when_
metrics gathering is done, and _what_. Since Prometheus best practices dictate
that metrics should be computed at pull time, it means you can fine-tune
collection intervals to specific metrics, and this can be done centrally. And
since you only pull from what you have, it means there can't be a rogue agent
somewhere that's spewing out data (i.e. what a sibling comment calls
"authorative sources").

But to understand why pull is a better model, you have to understand
Google's/Prometheus's "observer/reactor" mindset towards large-scale
computing; it's just easier to scale up with this model. Consider an
application that implements some kind of REST API. You want metrics for things
like the total number of requests served, which you'll sample now and then.
You add an endpoint /metrics running on port 9100. Then you tell Prometheus to
scrape (pull from)
[http://example.com:9100/metrics](http://example.com:9100/metrics). So far so
good.

The beauty of the model arises when you involve a dynamic orchestration like
Kubernetes. Now we're running the app on Kubernetes, which means the app can
run on many nodes, across many clusters, at the same time; it will have a lot
of different IPs (one IP per instance) that are completely dynamic. Instead of
adding a rule to scrape a specific URL, you tell Prometheus to ask Kubernetes
for all services and then use _that_ information to figure out the endpoint.
This dynamic discovery means that as you take apps up and down, Prometheus
will automatically update its list of endpoints and scrape them. Equally
importing, Prometheus goes to the _source_ of the data at any given time. The
services are already scaled up; there's no corresponding metrics collection to
scale up, other than in the internal machinery of Prometheus' scraping system.

In other words, Prometheus is _observing_ the cluster and _reacting_ to
changes in it to reconfigure it self. This isn't exactly new, but it's core to
Google's/Prometheus's way of thinking about applications and services, which
has subseqently coloured the whole Kubernetes culture. Instead of configuring
the chess pieces, you let the board inspect the chess pieces and configure
itself. You want the individual, lower-level apps to be as mundane as
possible, let the behavioural signals flow upstream, and let the higher-level
pieces make decisions.

This dovetails nicely with the observational data model you need for
monitoring, anyway: First you collect the data, then you check the data, then
you report anomalies within the data. For example, if you're measuring some
number that can go critically high, you don't make the application issue a
warning if it goes above a threshold; rather, you collect the data from the
application as a raw number, then perform calculations (e.g. max over the last
N mins, sum over the last N mins, total count, etc.) that you compare against
the threshold.

In practice, implementing a metrics endpoint is exceedingly simple, and you
get used to "just writing another exporter". I've written a lot of exporters,
and while this initially struck me as heavyweight and clunky, my mindset is
now that an HTTP listener is actually more lightweight than an "imperative"
pusher script.

~~~
zaroth
But why is it simpler for Prometheus to have to query Kube to discover all the
endpoints in order to collect the data, versus the endpoints just pushing out
to Prometheus?

Obviously endpoints already need to know how to contact all sorts of services
they depend on. So it's not like you're "saving" anything by not telling them
"PrometheusIP = X".

Let's say you want to cleanly shut-down some instances of your endpoint. They
are holding connection stats & request counts that you don't want to lose.
With push the endpoint can close its connection handler, finish any
outstanding requests, push final stats, and then exit. With pull are you
supposed to just sit and wait until a Pull happens before the process can
exit?

~~~
lobster_johnson
Because it shifts all the complexity to the monitoring system, making the
"agents" really, really dumb. There would have to be more to push than just a
single IP:

* Many installations run multiple Prometheus servers for redundancy, so to start, it'd have to be multiple IPs.

* They would also need auth credentials.

* They'd need retry/failure logic with backoff to prevent dogpiling.

* Clients would have to be careful to resolve the name, not cache the DNS lookup, in order to always resolve Prometheus to the right IP.

* If Prometheus moves, _every_ pusher has to be updated.

* Since Prometheus wouldn't know about pushers, it wouldn't know if a push has failed. As Prometheus is pull-based, you can detect actual failure, not just absence of data.

There's a lot to be said for Prometheus' principle of baking exporters into
individual, completely self-encapsulated programs — as opposed to things like
collectd, diamond, Munin, Nagios etc. that collect a lot of stuff into a
single, possibly plugin-based, system.

Don't forget, a lot of exporters come with third-party software. You want
those programs to have as little config as possible. If I release an open-
source app (let's say, a search engine), I can include a /metrics handler, and
users who deploy my app can just point their Prometheus at it. It's enticingly
simple.

As for graceful shutdown: The default pull frequency is 15 seconds, and you
can increase it if you want to avoid losing metrics. Prometheus is designed
not to deal with extremely fine-grained metrics; losing a few requests due to
a shutdown shouldn't matter in the big picture. But for metrics that _are_
sensitive, it's easy enough to bake them into some stateful store anyway
(Redis or etcd, for example), or computing them in real time from stateful
data (e.g. SQL). For example, if you have some kind of e-commerce order
system, it's better if the exporter produces the numbers by issuing a query
against the transaction tables, rather than maintaining RAM counters of
dollars and cents.

~~~
boto3
How do you handle aggregated metrics, e.g. request count? What does the
instance (either server/container) expose at localhost:9001/metrics?

~~~
foxylion
You would expose a counter with the total request count. Summing those up
across all nodes known by Prometheus will give you the total amount of
requests currently visible to monitoring. With "rate()" you could calculate
the requests/second.

But yes it is possible to miss some requests if a node goes down without
Prometheus collecting the latest stats.

But as the parent said, if you need such totals it might be better to store
them persistently. Also I do not know a scenario where the total number of
requests will trigger an alert.

~~~
bbrazil
> But yes it is possible to miss some requests if a node goes down without
> Prometheus collecting the latest stats.

The rate() function allows for this, you'll get the right answer on average.

------
irl_zebra
Working with Prometheus for a while. Only sticking point has been the 15 day
storage of metrics. Anyone know why this is the limit on storage, or any
strategies for long term storage of metrics?

~~~
alaties
As lobster_johnson stated, Prometheus storage retention is a command line
launch option.

Prometheus is not intentionally designed as a long term cold storage option
for metrics. You _can_ store metrics for as long as your storage allows, but
Prometheus is not going to replicate or manage that data to prevent long term
degradation. Depending on your long term needs, the preferred pattern is to
roll data off Prometheus into a metrics store that better handles data over
the scope of months. Rolling data off of Prometheus is done with an exporter
and documented
[here]([https://prometheus.io/docs/operating/integrations/#remote-
en...](https://prometheus.io/docs/operating/integrations/#remote-endpoints-
and-storage)).

In most use operational use cases I've come across, we've only needed about a
month of data, so keeping it in Prometheus was kosher. YMMV.

~~~
zaroth
What if you want to keep time-series data for accounting purposes forever?

Right now I use graphite but would love something that also handles
replication/redundancy with a good query language & enough performance to also
use it to fetch the underlying data used to render front-end graphs for
_users_.

~~~
alaties
The answer to these kinds of questions is very, very dependant on requirements
and available resources. And there are rarely easy, out of the box answers.

The big questions to ask are: \- what data is of interest \- what are the
requested ingest patterns (how is data getting into this system? How
frequently? What rate of ingest is expected?) \- what are the requested query
patterns (who is doing queries? What do those queries look like specifically?
Are people querying over unbounded time ranges or are people doing more
focused queries? Do queries regularly involve aggregations or not?) \- what is
the requested SLA (i.e. how much partial down time is okay? How much full
downtime is okay? What kind of query response time do you need to target?) \-
what resources are/will be available (money, man power, compute, storage, tech
on hand)

It's possible that a time series data store might not be the correct system
choice once these questions are answered. It's not unusual to see data split
or copied into multiple systems to answer all requirements.

~~~
zaroth
The most important consideration is the scale. A low scale system will not
require an optimum solution, merely a correct one.

So I would say, the system is 5 minute ingest of about 1 million metrics. This
is spread out over a half dozen locations, each which currently records in
their own silo.

And aggregate metric is calculated with a 5 minute lag, which reads all the
just-written data points and aggregates into sum-totals which are themselves
stored and cached in one place. This is another million metrics basically
stored separate from the rest.

But it doesn't really change the character of the system. In the end I'm
trying to; Write batches of mostly numeric data Queries over time against
those numbers Aggregate data over different blocks of time; 5 minute, hourly,
daily, monthly Store it efficiently Ensure redundancy, integrity

Seems like a simple and common enough problem to have been reasonably "solved"
for orders of magnitude higher scale than I'm operating at.

~~~
alaties
From what I've seen, each solution has been pretty specific to the situation.
I've rarely seen one tech stack prevail at low to reasonable scale.

At reasonable scale, I've seen people get really far with pure graphite setups
by utilizing tools like [carbon-c-relay]([https://github.com/grobian/carbon-c-
relay](https://github.com/grobian/carbon-c-relay)),
[carbonate]([https://github.com/graphite-
project/carbonate](https://github.com/graphite-project/carbonate)), and high
integrity filesystems like zfs underneath. It's a very hands on operation
though. Things like growing the cluster are hard to do without downtime.

If constant growth and uptime is a concern, something like openTSDB might be a
great choice. The complexity of setting up a Hadoop + HBase cluster is a
pretty big upfront cost, but man is this thing the cockroach of time series
data stores. Adding storage is just growing the HBase cluster. Querying across
years of data is pretty simple and straightforward. For the complexity
involved, openTSDB is worth it.

------
bogomipz
>"Monitoring Cloudflare's planet-scale edge network with Prometheus"

Is "planet scale" better than "web scale"?

~~~
dsl
It's just Cloudflare trying to make themselves sound bigger. Still nowhere
near Google or Amazon.

~~~
user5994461
They are similar in terms of users, requests and bandwidth.

~~~
bogomipz
Do you have a citation for that claim?

~~~
user5994461
They serve 10% of the internet traffic and have over 1 billion users. They
receive more than 5 million requests per second in average.

~~~
bogomipz
That's not really a citation. Where are these figures published? What does
that mean "10% the internet traffic?" By bandwidth? By request volume?

~~~
user5994461
You could start by reading the article you are commenting on. The traffic is
by requests.

You may not be aware but yes, cloudflare is a significant internet company.

~~~
bogomipz
Its not actually an article, its a slide deck with a bunch of marketing on it.
And I did read it. These are unsubstantiated numbers. 10% of total daily HTTP
requests? Where is the total daily number of HTTP requests on the internet
published? Cisco's VNI approximates bandwidth but I have never seen any number
for the total HTTP requests on the internet.

I am aware that Cloudflare is a CDN, most CDNs are substantial internet
companies. However I've worked at a couple of CDNs and they all throw a lot of
marketing numbers around.

~~~
Kalium
If memory serves, the 10% isn't based on some kind of global estimate of HTTP
traffic. It's based on what percentage of active IP-space they see on a daily
basis.

~~~
bogomipz
Now that's a marketing metric :)

