
Ask HN: How do you do monitoring/observability for dev/staging environment? - pranay01
If you are using APM tools like DataDog, NewRelic - do you also use them for staging environment? How do you optimise performance at staging - or the task of optimization is left to the production env. as long as things are working fine?<p>If you use APM tools for staging&#x2F;dev environment also - doesn&#x27;t it increase your cost significantly?
======
eoinbmorg
We have all our staging systems report to DD and Sentry, tagged with an
`env:stage/prod/whatever` tag to slice the metrics by. This helps maintain
parity for alerting too (although pagerduty is not enabled for stage.
Sometimes it's tough to get right because the lower volume of traffic on stage
makes the alert metric queries very noisy. For example, alerts that fire for
error rates may not resolve if no new successful requests come in for a few
hours on stage.

~~~
pranay01
How do you test the staging environment - is it replay of prod data or load
generated by scripts? How much extra cost does it incur on DD & Sentry (as a %
of prod monitoring cost)?

~~~
eoinbmorg
This is the hard part... I would say my current company does a poor job, and
every other company I've worked for has also done a poor job. They've treated
it like an internal playground for devs to validate that their features work,
not a representative copy of prod with all the scaling problems and user-data
funkiness that come with it. Here are some options I see:

\- load a replica of your prod DB to stage daily/weekly and have all the same
ETL jobs running

\- setup load testing or user behavior regression tests to automatically go
through critical pathways like user authentication and registration ("bare
essentials" functionality, since writing these is tedious). This might be a
good chance to use traffic-capture to at least get started/make setting up
these behavior tests easier

\- if it's a consumer-facing product, have employees dogfood the product on
stage

\- if it's a product for businesses, run your business off the stage or a 3rd
slightly more stable "internal" environment to create some consequences for
not keeping it running smoothly.

My experiences have not had representative load on stage, so the extra billing
is proportionally smaller (since you're paying what you use in most cases). I
don't know the billing specifics, but you can also consider dropping the
log/metric retention window significantly on stage (say 1mo instead of 6mos)
to save costs.

Ultimately I don't think you're going to get the same scaling problems to
manifest on stage. It's more of a functionality testing ground IME.

 __I have 3 YOE as a dev so don 't base your whole business plan on my ideas

~~~
pranay01
Got it. Have you tried traffic replay tools like
[https://github.com/buger/goreplay](https://github.com/buger/goreplay)?

~~~
eoinbmorg
I don't personally have experience with this approach

------
linsomniac
Our staging environment and our dev database both run on very under-powered
servers, by design. The idea being that we notice performance problems, at
least as far as coding and queries. Catching the most egregious cases.

We do feed data into influx via telegraf, elasticsearch, and sentry, and also
do system monitoring via Icinga2. But in dev/stg it is not treated as
actionable.

~~~
mrwnmonm
What is your website?

