
CloudWatch Is of the Devil, but I Must Use It - QuinnyPig
https://www.linuxjournal.com/content/cloudwatch-devil-i-must-use-it
======
djhworld
The UI for Cloudwatch Logs completely crumbles if you use it for Lambda
functions, especially if you have a lot of executions.

We run ETL pipelines and have lambda as a component of that, maybe 250k-500k
invocations a day.

If you want to search through the logs for a particular invocation at a
particular time (e.g. when an error happened) the UI just cannot handle it.

I think part of the reason is Lambda creates a new "log stream" every time a
new container is created, which I'm guessing is how they shard the logs. So
you end up with hundreds/thousands of log streams within your lambda log group
and the backend has to search them to find the particular search pattern you
are requesting.

Most of the time we have to export the logs to S3, download them and do some
manual grepping.

~~~
niklasrde
We move our Logstreams to our ElastiCache cluster to make them somewhat usable
- using another lambda which receives subscription filters that are attached
to a log group.

The processor also periodically scans new log groups under the /aws/lambda/
prefix, and creates subscriptions for them if they don't exist already.

Speaking about, this may be a good candidate for a more in-depth blogpost

~~~
shshhdhs
Yes, I’d love to see the blog post!

------
mjulian
He gets linked here pretty regularly, but for those who don't follow him, the
author runs an email newsletter that's basically him throwing shade at AWS
every week. It's fucking hilarious AND useful.

[https://lastweekinaws.com/](https://lastweekinaws.com/)

~~~
chair6
Yes! Last Week In AWS and Python Weekly are the only two weekly email
newsletters I've found to be worth reading in their entirety every issue...

------
ris
Something that wasted a lot of my time this and last week was discovering
that, _from what I can tell_ , the CloudWatch Metrics API will almost never
give you an actual error message. Request something with the wrong
credentials? "Sure, here's a 200 response with zero datapoints". Request a
metric that doesn't exist? "Sure, here's a 200 response with zero datapoints".
Request a metric from the wrong region? "Sure, here's a 200 response with zero
datapoints"...

It makes narrowing down a problem extremely un-fun.

------
unkoman
I'd argue that cloudwatch is mainly an event source to drive other AWS
infrastructure. Monitoring second, which might be why the experience is so
sour.

~~~
QuinnyPig
You could definitely make that argument, but the CloudWatch product page's
first sentence is "Amazon CloudWatch is a monitoring and management service
built for developers, system operators, site reliability engineers (SRE), and
IT managers."

CloudWatch Events is largely solid; I agree with you!

~~~
derefr
Think of them as two separate products, and then mentally invert the branding:
"CloudWatch" (and the rest of its umbrella) are really sub-products of
"CloudWatch Events."

How did that happen? Accident of history:

First, CloudWatch Events was built for internal AWS usage.

Then, CloudWatch was _also_ built for internal AWS usage (EC2 health checks,
fed by the event stream!)

Then, CloudWatch was exposed, since people need to be able to configure those
health checks and provision them in CloudFormation and such. So, they had this
"monitoring" service—may as well slap together a logging dashboard and stuff.
(And then forget about that as soon as the MVP is done, because writing their
own equivalent of the Elastic-Logstash-Kibana stack was never AWS's intent.)

Then, after SNS came around and there was somewhere user-controllable for the
raw CloudWatch Events events to be routed to, CloudWatch Events could be
exposed, too. It got the sub-name because they had exposed the higher-level
CloudWatch product first.

If SNS came about _before_ CloudWatch, then probably it'd just be CloudWatch
Events with a set of suggested "one-click launchable partner AMIs" for
ingesting+indexing+searching/analyzing your AWS event data, and no first-party
dashboard at all. And "CloudWatch" would remain an internal feature to power
health checks without its own branded dashboard; just an unbranded presence
under other services' dashboards/RPC endpoints/CloudFormation type
hierarchies.

~~~
tracker1
I really wish they would go so far as to get a Kibana-like front end and
expand CW to ElasticSearch levels of UX.

In the end, I really appreciate the ELK stack, and if you can either use
ElasticSearch for other things, or have enough logging to justify a cluster,
it (ELK) is a great option.

------
sudosteph
If only I had a nickel for every blog post decrying an AWS Service for not
being something it wasn't ever intended to be...

I'll just go over his complaints about CloudWatch and Lambda:

> If you're using Lambda or Fargate, you have no choice but to use CloudWatch
> Logs, wherein searching for everything is absolutely terrible.

I can't speak to Fargate, but you can absolutely use lambda without CW Logs.
Just don't give your lambda an IAM role that lets it write to CW, boom no more
logs in CW. If you want to send it to another service, there are options too.
I don't see why something like this couldn't run on a lambda:
[https://www.loggly.com/docs/python-http/](https://www.loggly.com/docs/python-
http/)

His example of diagnosing an error is equally silly.

> Find the fact that it encountered an error in the first place by looking at
> the invocation error CloudWatch dashboard. I also could set up a filter to
> run a continuous query on the logs and alert when something shows up, except
> that isn't natively supported—I need a third-party tool for that (such as
> PagerDuty).

False. It's trivial to set up a CloudWatch Alarm + SNS notifications on lambda
failures. SNS can go to email, phone, whatever. I do it all the time. I've
even got a little slack bot that listens to certain SNS notifications and then
gives a link back to cloudwatch logs.

> Go diving into a variety of CloudWatch log groups and find the one named
> after the specific erroring function. > Scroll manually through the many,
> many, many pages of log groups to find the specific invocation that threw an
> error.

Or just go to the lambda you're working on and click on the nice blue
"monitoring" tab and link to the logs there.

> Realize that the JSON object that's retained isn't enough to troubleshoot
> with, cry in despair, and go write an article just like this one.

If you aren't logging what needs to be logged, that's kinda on you.

~~~
idbehold
> I don't see why something like this couldn't run on a lambda:
> [https://www.loggly.com/docs/python-
> http/](https://www.loggly.com/docs/python-http/)

It would, but then you're also paying for the time it takes to flush those
logs to the other service before the process exits.

------
esotericn
The pricing stuff really bothers me with AWS. Maybe I missed something in the
interface? (It's bloody complicated).

Seemingly Amazon can provide and spin up all of these services at a second's
notice, but can't provide realtime billing information, or realtime
predictions.

You have _so many_ seemingly arbitrary sources of cost that for anything
moderately complicated it feels very much like the model is "use it, see what
it costs, pay us for it, then cancel if you don't like it" which doesn't
really mix well with, well, how I spend money.

Maybe it's a deliberate strategy to squeeze out low margin customers or
something?

(edit: so theoretically, given that all the pricing information is public, I
could write my own backend that comes up with a figure on the fly... anyone
got anything clever like that? :P)

~~~
AjayTripathy
Hey, former Google Cloud SRE here-- now work on projects to help companies
manage cloud spend. Prediction and especially recommendations for lowering
cloud spend can get tricky, but I wrote a piece of software and some grafana
dasboards that with one click can deploy and calculate cluster costs if you're
using kubernetes. Obviously this doesn't include, say, s3 storage costs per
bucket, and doesn't much help if you're not using kubernetes, but this is just
a first step.

More general solutions can be instrumented with cloudhealth
[https://www.cloudhealthtech.com/](https://www.cloudhealthtech.com/), but
that's a bit more of an enterprise-y solution.

[https://medium.com/kubecost/effectively-managing-
kubernetes-...](https://medium.com/kubecost/effectively-managing-kubernetes-
with-cost-monitoring-96b54464e419)

------
pjungwir
I've tried all kinds of monitoring solutions, and my favorite is still
munin[1]. Out of the box it gives you tons of system-level metrics. It comes
with contrib plugins for service-level metrics about Postgres, Nginx, and
dozens of other popular tools. Then you can write your own plugins for higher-
level things. A plugin only has to print a few lines of text, and you can
write it in whatever language you like. I have a plugin for Phusion
Passenger[2] and another for blocking Postgres queries[3]. You could even
collect more application-level metrics if you wanted.

More things I love about munin:

\- It is _fast_. It's just pre-generated static html files, so you never have
to wait for a page to load.

\- Simple to configure: the only tricky part is a config file with a list of
all your nodes, but that is easy to generate with
Chef/Puppet/Ansible/Fog/whatever.

\- Information density: Tufte would love it. No commercial tool comes close
here I think in terms of getting raw info onto the screen at once. This is so
valuable for seeing trends or anomalies. When something is going haywire,
munin often helps me find the problem in minutes.

\- Lots of graphs all at once: I also like that I don't have to navigate or
fiddle with zoom levels to see what I want. Navigation is almost always one
click and then Ctrl-F or using the mouse wheel.

\- Plenty of history: you can look back at the last year of data to see
trends.

\- Easy to install: just apt-get. Adding a plugin is just a symlink.

\- Your data never leaves your own machines.

I think CloudWatch is fine for AWS-specific metrics, like your EBS burstable
I/O budget, but for everything else I'd much rather use munin.

[1] [http://munin-monitoring.org/](http://munin-monitoring.org/)

[2]
[https://github.com/pjungwir/munin_passenger](https://github.com/pjungwir/munin_passenger)

[3]
[https://github.com/pjungwir/munin_postgres_extras](https://github.com/pjungwir/munin_postgres_extras)

~~~
amq
As much as I like munin, it's just not covering the current needs around
application monitoring, containers, dynamically scaled clusters and more
detailed metric inspection. The go-to solution for me now is Prometheus +
Grafana, it does all of the above and is directly supported by many products.

------
adreamingsoul
I wonder how the Amazon engineers and developers of AWS services feel about
this post and people’s responses.

~~~
QuinnyPig
I've never known Amazon to ignore customer pain once they're made aware of it.
We'll see what happens...

