Hacker News new | past | comments | ask | show | jobs | submit login
Saving Three Months of Latency with a Single OpenTelemetry Trace (checklyhq.com)
101 points by serverlessmom 5 months ago | hide | past | favorite | 41 comments



On the noisy NodeJS auto-instrumentation, it is indeed very noisy out of the box. Myself along with a bunch of other ppl finally got the project to allow you to select the instrumentations via configuration. Saves having to create your own tracer.ts/js file.

Here's the PR that got merged earlier in the year: https://github.com/open-telemetry/opentelemetry-js-contrib/p...

The env var config is `OTEL_NODE_ENABLED_INSTRUMENTATIONS`

Anyways, love Opentelemetry success stories. Been working hard on it at my current company and yielding fruits already :)


That is awesome. Had no idea this was available as an env var. After diving into OTel for our backend, we also found some of this stuff is just too noisy. We switch it of using this code snippet, for anyone bumping into this thread:

   instrumentations: [getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': {
        enabled: false,
      },
      '@opentelemetry/instrumentation-net': {
        enabled: false,
      },
      '@opentelemetry/instrumentation-dns': {
        enabled: false,
      },


yeah I totally turned all of those off...way too noisy :)


Why would you disable instrumentation instead of just filtering the recorded log?

That only makes sense if the instrumentation overhead itself is significant. But, for a efficient recording implementation that should only really start being a problem when your average span is ~1 us.


Oh, simple answer. The tools you use to inspect those traces just blow up with noise. Like a trace that shows 600+ of file reads that all take less than half a millisecond.

This is all noise when you are trying to debug more common issues than your FS being too slow.

+ also storage cost. Most vendors charge by Mb stored or span recorded.


That is why I mentioned post-filtering the recording as the alternative. Grab the full recording then filter to just the relevant results before inspection.

For that matter, why are a few hundred spans a problem? Are the visualizers that poor? I usually use function tracing where hundreds of millions to billions of spans per second are the norm and there is no difficulty managing or understanding those.


In most cases these traces are shipped over the wire to a vendor. Only that will cost $$. Then, not all vendors have tail sampling as a "free" feature. So, in many cases it's better to not record at all.


That sounds positively dystopian. Is it really that hard to dump to private/non-vendor storage for local analysis using your own tools?

I do not do cloud or web development, so this is just totally alien. I generate multi-gigabyte logs with billions of events for just seconds of execution and get to slice them however I want when doing performance analysis. The inability to even process your own logs seems crazy.


You can absolutely dump the traces somewhere and analyze them yourself. The problem is that this falls apart with scale. You are maybe serving thousands of requests per second. Your service has a ton of instances. Capturing all trace data for all requests from all services is just difficult. Where do you store all of it? How do you quickly find what you need? It gets very annoying very fast. When you pay a vendor, you pay them to deal with this.


This is great. Last time I tried this I couldn’t even find a way in code to disable some.


This is so cool! I’ve had this exact problem before.


It's AWS, if you shake a stick at some network transfer optimization or storage/EBS/S3 you'll save three engineers salary.


This deserves an updoot. We reached a level of scale right now at Checkly that all of these things start adding up. We moved workloads off of S3 to Cloudflare R2 because of this.


> We moved workloads off of S3 to Cloudflare R2 because of this.

So you moved from a mature but expensive storage solution to a younger currently subsidized storage solution? What happens when R2 jacks up pricing?


let me nuance that a bit. 99% of our workload is write heavy and is still on S3. We run monitoring checks that snap a screenshot and record a video. We write that to S3. Most folks will never view any of that as most checks pass and these artefacts only become interesting when things fail.

Enter a new product feature we launched (Visual Regression Testing) which requires us to fetch an image from storage on every "run" we do. These could be every 10sec. This is where R2 shines. No egress cost for us. It's been rock solid and saved us about 60x compared to AWS. Still, we run most of our infra on AWS.


> currently subsidized storage solution

Interesting, do you have a source on the subsidized nature of R2?


I do not and I'm likely misusing the word subsidized.

My concern is that as a newer product (R2 in 2022 [1] compared to S3 2006 [2]) R2 has deliberately priced itself to compete with egress pricing of S3 in order to gain market share and developer mindshare. I am not confident Cloudflare will maintain this competitive pricing indefinitely as I expect it to follow well established industry trends of jacking up prices once a walled garden has been sufficiently establed.

Further its my opinion cloud costs have grown at an absurd level as engineers and executives made poor and frankly lazy technology choices for the last decade.

Ultimately I like cloudflare a lot but I think we need more discipline and lower operational overhead if we want infrastructure development to remain practical to individuals and small businesses versus mega-corps. Cloudflare with its free pricing tiers is often a default choice for these organizational sizes but it should not be viewed as a panacea and carries tradeoffs as with everything in life.

[1] https://www.cloudflare.com/press-releases/2022/cloudflare-ma...

[2] https://hidekazu-konishi.com/entry/aws_history_and_timeline_...


I wish posts like this would explore the relative savings rather than the absolute. On its own I don’t feel like that saving is really telling me much, taken to the extreme you could just not run the service at all and save all the time - a tongue in cheek example but in context is this saving a big deal or is it just engineering looking for small efficiencies to justify their time?


I'm the author of the post. You raise a good point about relative savings. Based on last week's data, our change reduced the task time by 40ms from an average of 3440ms, and this task runs 11 million times daily. This translates to a saving of about 1% on compute.


Thanks for the follow up, sounds like a decent saving and investment of time then.


Fun fact: it probably took more time to write up and refine the blog post than it did to hunt down that sneaky 40ms savings.


True but the value of the hunt and fix may really come from this blog post long term. Content marketing and all that


> This translates to a saving of about 1% on compute.

Does this translate to any tangible savings? I'm not sure what the checkly backend looks like but if tasks are running on a cluster of hosts vs invoked per-task it seems hard to realize savings. Even per-task, 40 ms can only be realized on a service like Lambda—ECS minimum billing unit is 1 second afaik.


I think that’s flawed analysis, if you’re running FaaS then sure you can fail to see benefit from small improvements in time (AWS Lambda changed their billing resolution a few years back but before then the Go services didn’t save much money despite being faster) but if you’re running thousands of requests, and speeding them all up, you should be able to realize tangible compute savings whatever your platform.


Help me to understand, then. If this stuff is being done on an autoscaling cluster, I can see it, but if you are just running everything on an always-on box for instance, it is less clear to me.

edit: Do you have an affiliation with the blog? I ask because you have submitted several articles from checkly in the past.


Hey Checkly founder here, we changed our infra quite a bit over the last ~1 year. Still, it's mostly ephemeral compute. We started actually on AWS Lambda. We are on a mix of AWS EC2 and EKS now, all autoscaled per region (we run 20+ of them).

It seems tiny, but in aggregate this will have an impact on our COGS. You are correct that if we had a fixed fleet of instances, the impact would have been not super interesting.

But still, for a couple of hours spent, this saves us quite some $1Ks per year.


Yes I work at Checkly, though I didn’t answer authoritatively since this one wasn’t written by me!


The units seem wrong in any case. It's 3 months of compute per day, which is actually much more impressive.

If we think about the business impact, we don't usually think of compute expenditure per-day, so you might reasonably say, the fix saved 90 years of annual compute. Looks better in your promotion packet, too.


I often ask myself the same question. We have some user facing queries that slow the frontend down. I’ve fixed some slowness but it’s definitely not a priority. I wonder how much speed improvements correlate with increased revenue by happy customers.


Bit late to the party, but companies report that webpage speed correlates with conversion. See e.g. https://www.cloudflare.com/en-gb/learning/performance/why-si... & https://www.cloudflare.com/en-gb/learning/performance/more/w...

This one is also interesting; written in 2012, it claims that Amazon could lose 1b+ from a 1 sec slowdown: https://www.fastcompany.com/1825005/how-one-second-could-cos.... I imagine people are even less tolerant of slow pages today.

Fixing website performance can be one of the cheapest ways to increase conversion because it's hard to figure out what else moves the needle.


Think of this like changing the oil in your car.

Over-optimizing is not going to help you at all but if you ignore it eventually it will all seize up.

You have to keep that stuff in check.


Hey, I work at Checkly and asked my coworker (who wrote the post) to give some more background on this. I can assure you, we're busy and this was not done for some vanity price!


I agree, but this post looks like an advertisement about the service itself.


It’s literally on the company’s blog, which is partially about promoting the company’s service. What’s the issue with that?

(Long time happy Checkly user here, the service is fantastic)


Not a problem, but the OP is questioning about the savings!

I, for example, like to dive more on insights like the relative savings vs absolut to learn the approaches other engineers take! It's all about metrics we should take care.

(I'll put this service on my list to try someday, looks like fantastic indeed)


μs isn't picoseconds, it's microseconds, which are a million times bigger...


Thank you for pointing that out! You are correct, μs stands for microseconds, not picoseconds. I've corrected the mistake, and the update should be visible as soon as the CDN cache invalidates.


Every day I have more sympathy for the Mars Climate Orbiter team. https://science.nasa.gov/mission/mars-climate-orbiter/


Is latency the same thing as duration? I think of latency as being more like a vector-with-starting-point (a “ray segment”?) than a scalar, it’s “rooted” to a point in time, so it doesn’t make sense to sum them.


Given how high frequent this thing is, I'd say it's worth exploring moving away from Node; I don't associate Node with high performance / throughput myself.


Just a friendly call out that checkly is an awesome service.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: