
Tell Me More, Nginx (2017) - kiyanwang
https://www.honeycomb.io/blog/tell-me-more-nginx/
======
latch
The company that I recently joined uses honeycomb.io, versus ELK at my
previous job. Maybe I haven't played with it enough, or maybe we have it badly
configured, but I find it a huge step backwards. As far as I know, there's no
full text search and the types of reports and aggregation you can do are
extremely primitive. The UI is also barely usable..huge horizontal scrolling,
can't click on a specific value to apply that sub-filter. The list goes on and
on. I wish devops team would focus on these fundamentals (proper centralized
logging) before playing with all the "cool" and trendy toys.

As for logging, Our `log_format` directive looks like:

    
    
      log_format stats '$time_iso8601\t$host\t$request_method\t$request_uri\t$request_length\t$status\t$upstream_cache_status\t$bytes_sent\t$request_time\t$upstream_response_time\t$upstream_http_x_route\t$client_id';
    

This captures most of the generic data you'd associate with a request (time,
host, method, uri, length) as well as the generic response data you'd care
about (status, length, time to reply). Note that `$request_time` measures the
full time to first byte until time to last byte that nginx spends serving the
request, vs $upstream_response_time which is more specific to the upstream's
response time.

If you're using nginx as a cache, $upstream_cache_status tells you the cache
hit/miss status.

A list of all variables is available at
[http://nginx.org/en/docs/varindex.html](http://nginx.org/en/docs/varindex.html)

All of our services can set an "x-route" header, which helps canonicalize URIs
into something more meaningful. It's up the the service to decide what to
do..but /v1/users/ID could be called "users:show". A more complex route might
use a different name based on arguments. For example, depending on the
parameters we expect some hits to /v1/reports to be very fast, and some to be
very slow, so we'll set the name to "reports:list:fast" or "reports:list:slow"
so that we can get more accurate statistics.

Finally, we use OpenRestry to do some initial global authentication (along
with some other stuff). This is where $client_id comes from. All requests go
through something like:

location / { set $client_id '' set $upstream '' access_by_lua_block {
require('execute')() } proxy_pass [http://$upstream;](http://$upstream;) }

A piece of that execute code will possibly set $client_id to something which
will then get logged.

~~~
mherdeg
> As far as I know, there's no full text search and the types of reports and
> aggregation you can do are extremely primitive.

My understanding is that it's helpful to think of honeycomb.io as a
replacement for Scuba -- which, to my understanding, is Facebook's tool for
doing fast math and aggregations on metric data even if those metrics come
from things that look like logs.

There are some "visualize" things you can do in the Kibana UI which really
stretch ELK's capabilities (graphing p99 latency per handler for every one-
second bucket of a sample minute in a high-throughput service that writes all
its hits to log files, for example). Where Kibana and its backends start to
stutter, or when certain kinds of graphs are just impossible, you might begin
to be well-served by Scuba-style tools.

Not having worked at Facebook, I have only heard about this secondhand. After
a certain scale -- when you have many many requests per second and when
handling a request involves a forest of other services doing some work in
series or parallel -- you start to feel like you can no longer rely on raw
logs and start to lean on tools who do things like distributed tracing or
aggregations; if your tool work well, eventually you may even feel a disdain
for those raw logs and feel like you're _better_ off using the more advanced
tools.

I feel a bit left behind sometimes because I live pragmatically in both worlds
(I'm excited to get results rapidly from newer visibility stacks, but honestly
am also still happy to log into a system and read its logs -- just in case we
ever need to do that).

------
ludwigvan
Tacking a custom header to the response is a nice trick!

Could be particularly useful for GraphQL apps where every response is a big
opaque answer from `/api/graphql`: Add the GraphQL operation name to the
response header.

This will also help with Chrome Devtools where you can add custom headers to
the Network tab, otherwise you just see tons of POST reqs to `/api/graphql`:
[https://umaar.com/dev-tips/115-manage-network-response-
heade...](https://umaar.com/dev-tips/115-manage-network-response-headers/)

------
mooreds
As the last piece of most architectures to see every write request (caches
will see every request, including reads), web servers are the logical place to
capture system wide data. (Caches, if configurable, may be a slightly better
option.) I haven't done a lot of nginx config, but appreciated the pointers in
this post. On the other hand, it was a bit short and no pointer to the next
installment.

By the way, if you haven't read the blog of Honeycomb founder, it's full of
interesting provocative opinions:

[https://charity.wtf/](https://charity.wtf/)

------
amelius
Ok but if privacy sensitive data enters your logs, you better have a way to
clean them.

------
minaguib
FWIW, past a certain scale, it makes a lot more sense to export dedicated
telemetry out of your app into a TSDB.

The reason is, it's thousands of times cheaper to +1 a counter/record a gauge
and flush it every once in a while, than to serialize per-request metrics into
HTTP headers, log them, and re-parse them for analytics later.

~~~
cyen
Totally, if all you care about is just that counter/gauge -- but more and more
often, you need that counter/gauge captured for a particular segment of your
traffic (e.g. some app-level identifier), and TSDBs tend to struggle[0] as the
number of possible segments explodes.

If all you care about is overall latency, awesome! Use a TSDB. Once you care
about latency per endpoint/user agent/customer ID/client platform (or
combination thereof), you need the flexibility associated with structured log
data, stored in something meant for fast analytical querying.

0: [https://www.honeycomb.io/blog/the-problem-with-pre-
aggregate...](https://www.honeycomb.io/blog/the-problem-with-pre-aggregated-
metrics-part-2-the-aggregated/)

------
vbsteven
Nginx is an amazing workhorse and can help out with
logging/monitoring/metrics.

Some tools I like to use to improve the experience:

* Fluentular [1] - A tool for matching Fluentd log parsing to your customized nginx log format

* Nginx-lua-prometheus [2] - Library for exposing request data to prometheus

[1] [https://fluentular.herokuapp.com/](https://fluentular.herokuapp.com/)

[2] [https://github.com/knyar/nginx-lua-
prometheus](https://github.com/knyar/nginx-lua-prometheus)

------
tyingq
Related, Apache has %D to put request times in logs, and mod_log_config if you
want to log header values. Logging header values has some security/privacy
implications though.

~~~
combatentropy
I'm surprised more people don't use the logs in web servers. Out of the box
they are already on and running with little overhead. They usually tell me
everything I want to know about usage.

But I don't do things like custom headers:

> a header value generalizing the shape of the endpoint path > (e.g.
> /products/foobarbaz → /products/:prod)

Instead I analyze the logs with Bash (grep, cut, awk, sort, uniq). In fact I
think Apache's logs are one of the main places where I learned these tools.

~~~
umanwizard
Sorry for the nitpick, but it’s a pet peeve of mine that people think the
entire suite of Unix command line tools is part of “bash”.

~~~
combatentropy
Thank you, I will try to avoid it. In fact I started calling them all "bash"
only recently, seeking a short name instead of "Unix command-line tools." But
you're right, it's inaccurate.

~~~
CamJN
The tools you listed can be referred to as coreutils:
[https://en.wikipedia.org/wiki/GNU_Core_Utilities](https://en.wikipedia.org/wiki/GNU_Core_Utilities)

------
perfmode
envoy proxy knocks this out of the park. one can even emit detailed log data
as proto over grpc.

------
bovermyer
This is useful to know. I wish I'd known it five years sooner.

~~~
pighive
I am curious to know, why?

~~~
bovermyer
It would have been useful in some monitoring efforts I was part of back then.

