Hacker News new | past | comments | ask | show | jobs | submit login

Are you making sure that you're doing a sample rate, but send over all errors?

At a former place, we were doing 5% of non-error traces.




Careful, we've had systems go down under increased load just emitting errors if they didn't emit much in non error state


Can you go into more detail about your comment, please.


Not the GP, but:

Imagine you're sampling successful traces at, say, 1%, but sending all error traces. If your error rate is low, maybe also 1%, your trace volume will be about 2% of your overall request volume.

Then you push an update that introduces a bug and now all requests fail with an error, and all those traces get sampled. Your trace volume just increased 50x, and your infrastructure may not be prepared for that.


Sorry been busy running around all day. Basically what's happened for us on some very high transaction per second services is that we only log errors. Or Trace errors. And the service basically never has errors. So imagine a service that is getting 800,000 to 3 million request a second. And this is happily going along basically not logging or tracing anything. Then all the sudden a circuit opens on redis and for every single one of those requests that was meant to use that open circuit to redis you log or trace an error. You went from a system that is doing basically no logging or tracing to one that is logging or tracing at 800,000 to 3 million times a second. What actually happens is you open the circuit on redis because red is a little bit slow or you're a little bit slow calling redis and now you're logging or tracing 100,000 times a second instead of zero and that bit of logging makes the rest of the requests slow down and now you're actually within a few seconds logging or tracing 3 million requests a second. You have now toppled your tracing system your logging system and the service that's doing the work. Death spiral ensues. Now the systems that are calling this system starts slowing down and start tracing or logging more because they're also only tracing or logging mainly on error. Or sadly you have a better code that assumes that the tracing are logging system is up always and that starts failing causing errors and you get into doing extra special death loop that can only be recovered from by only attempting to log or error during an outage like this and you must push to fix. All the scenarios have happened to me in production.

In general you don't want your system to do more work in a bad state. In fact as the AWS well architected guide say when you're overloaded or you're in a heavy air State you should be doing as little work as possible. So that you can recover


We've seen problems with memory usage on failure too. Python implementation sends data to the collector in a separate thread from the http server operations. But if these start failing, its configured for exponential backoff, so it can hold onto a lot of memory, and start causing issues with container memory limits.


I've configured our systems to start dropping data at this point and emit an alarm metric that logging/metrics are overloaded


I think what they means is that if you provisioned your system to receive spans for 5% of non-error requests and a few error requests, if for some random act of god, all the requests yield an error, your span collector will suddenyl receive spans for all requests.


How do you send all errors? The way tracing works, as I understand it, is that each microservice gets a trace header which indicates if it should sample and each microservice itself records traces. If microservice A calls microservice B and B returns successfully but then A ends up erroring, how can you retroactively tell B to record the trace that it already finished making and threw away? Or do you just accept incomplete traces when there are errors?


You can do head-based sampling and tail-based sampling.

With head sampling, the first service in the request chain can make the decision about whether to trace, which can reduce tracing overhead on services further down.

With tail-based sampling, the tracing backend can make a determination about whether to persist the trace after the trace has been collected. This has tracing overheads, but allows you to make decisions like “always keep errors”.


https://opentelemetry.io/docs/concepts/sampling/ describes it as Head/Tail sampling, but in practice with vendors I see it as Ingestion sampling and Index sampling. We send all our spans to be ingested, but have a sample rate on indexing. That allows us to override the sampling at index and force errors and other high value spans to always be indexed.


Maybe the Go client doesn't support that? https://opentelemetry.io/docs/instrumentation/go/sampling/


It does, but the docs aren't clear on that yet. TraceIdRatioBased is the "take X% of traces" sampler that all SDKs support today.


Normally yes, but we do a lot of data collection and identifying what's an error is usually hard because of partial errors. We also care about performance, per tenant and per resource with lots of dimensionality and sampling reduces that information for us.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: