Are you making sure that you're doing a sample rate, but send over all errors? A...

grogenaut · on Aug 28, 2023

Careful, we've had systems go down under increased load just emitting errors if they didn't emit much in non error state

_boffin_ · on Aug 28, 2023

Can you go into more detail about your comment, please.

cwp · on Aug 28, 2023

Not the GP, but:

Imagine you're sampling successful traces at, say, 1%, but sending all error traces. If your error rate is low, maybe also 1%, your trace volume will be about 2% of your overall request volume.

Then you push an update that introduces a bug and now all requests fail with an error, and all those traces get sampled. Your trace volume just increased 50x, and your infrastructure may not be prepared for that.

grogenaut · on Aug 29, 2023

Sorry been busy running around all day. Basically what's happened for us on some very high transaction per second services is that we only log errors. Or Trace errors. And the service basically never has errors. So imagine a service that is getting 800,000 to 3 million request a second. And this is happily going along basically not logging or tracing anything. Then all the sudden a circuit opens on redis and for every single one of those requests that was meant to use that open circuit to redis you log or trace an error. You went from a system that is doing basically no logging or tracing to one that is logging or tracing at 800,000 to 3 million times a second. What actually happens is you open the circuit on redis because red is a little bit slow or you're a little bit slow calling redis and now you're logging or tracing 100,000 times a second instead of zero and that bit of logging makes the rest of the requests slow down and now you're actually within a few seconds logging or tracing 3 million requests a second. You have now toppled your tracing system your logging system and the service that's doing the work. Death spiral ensues. Now the systems that are calling this system starts slowing down and start tracing or logging more because they're also only tracing or logging mainly on error. Or sadly you have a better code that assumes that the tracing are logging system is up always and that starts failing causing errors and you get into doing extra special death loop that can only be recovered from by only attempting to log or error during an outage like this and you must push to fix. All the scenarios have happened to me in production.

In general you don't want your system to do more work in a bad state. In fact as the AWS well architected guide say when you're overloaded or you're in a heavy air State you should be doing as little work as possible. So that you can recover

Topgamer7 · on Aug 28, 2023

We've seen problems with memory usage on failure too. Python implementation sends data to the collector in a separate thread from the http server operations. But if these start failing, its configured for exponential backoff, so it can hold onto a lot of memory, and start causing issues with container memory limits.

grogenaut · on Aug 29, 2023

I've configured our systems to start dropping data at this point and emit an alarm metric that logging/metrics are overloaded

Longwelwind · on Aug 28, 2023

I think what they means is that if you provisioned your system to receive spans for 5% of non-error requests and a few error requests, if for some random act of god, all the requests yield an error, your span collector will suddenyl receive spans for all requests.

fastest963 · on Aug 28, 2023

How do you send all errors? The way tracing works, as I understand it, is that each microservice gets a trace header which indicates if it should sample and each microservice itself records traces. If microservice A calls microservice B and B returns successfully but then A ends up erroring, how can you retroactively tell B to record the trace that it already finished making and threw away? Or do you just accept incomplete traces when there are errors?

sigwinch28 · on Aug 28, 2023

You can do head-based sampling and tail-based sampling.

With head sampling, the first service in the request chain can make the decision about whether to trace, which can reduce tracing overhead on services further down.

With tail-based sampling, the tracing backend can make a determination about whether to persist the trace after the trace has been collected. This has tracing overheads, but allows you to make decisions like “always keep errors”.

phamilton · on Aug 28, 2023

https://opentelemetry.io/docs/concepts/sampling/ describes it as Head/Tail sampling, but in practice with vendors I see it as Ingestion sampling and Index sampling. We send all our spans to be ingested, but have a sample rate on indexing. That allows us to override the sampling at index and force errors and other high value spans to always be indexed.

fastest963 · on Aug 28, 2023

Maybe the Go client doesn't support that? https://opentelemetry.io/docs/instrumentation/go/sampling/

phillipcarter · on Aug 28, 2023

It does, but the docs aren't clear on that yet. TraceIdRatioBased is the "take X% of traces" sampler that all SDKs support today.

CSDude · on Aug 28, 2023

Normally yes, but we do a lot of data collection and identifying what's an error is usually hard because of partial errors. We also care about performance, per tenant and per resource with lots of dimensionality and sampling reduces that information for us.