On the noisy NodeJS auto-instrumentation, it is indeed very noisy out of the box. Myself along with a bunch of other ppl finally got the project to allow you to select the instrumentations via configuration. Saves having to create your own tracer.ts/js file.
That is awesome. Had no idea this was available as an env var. After diving into OTel for our backend, we also found some of this stuff is just too noisy. We switch it of using this code snippet, for anyone bumping into this thread:
Why would you disable instrumentation instead of just filtering the recorded log?
That only makes sense if the instrumentation overhead itself is significant. But, for a efficient recording implementation that should only really start being a problem when your average span is ~1 us.
Oh, simple answer. The tools you use to inspect those traces just blow up with noise. Like a trace that shows 600+ of file reads that all take less than half a millisecond.
This is all noise when you are trying to debug more common issues than your FS being too slow.
+ also storage cost. Most vendors charge by Mb stored or span recorded.
That is why I mentioned post-filtering the recording as the alternative. Grab the full recording then filter to just the relevant results before inspection.
For that matter, why are a few hundred spans a problem? Are the visualizers that poor? I usually use function tracing where hundreds of millions to billions of spans per second are the norm and there is no difficulty managing or understanding those.
In most cases these traces are shipped over the wire to a vendor. Only that will cost $$. Then, not all vendors have tail sampling as a "free" feature. So, in many cases it's better to not record at all.
That sounds positively dystopian. Is it really that hard to dump to private/non-vendor storage for local analysis using your own tools?
I do not do cloud or web development, so this is just totally alien. I generate multi-gigabyte logs with billions of events for just seconds of execution and get to slice them however I want when doing performance analysis. The inability to even process your own logs seems crazy.
You can absolutely dump the traces somewhere and analyze them yourself. The problem is that this falls apart with scale. You are maybe serving thousands of requests per second. Your service has a ton of instances. Capturing all trace data for all requests from all services is just difficult. Where do you store all of it? How do you quickly find what you need? It gets very annoying very fast. When you pay a vendor, you pay them to deal with this.
This deserves an updoot. We reached a level of scale right now at Checkly that all of these things start adding up. We moved workloads off of S3 to Cloudflare R2 because of this.
let me nuance that a bit. 99% of our workload is write heavy and is still on S3. We run monitoring checks that snap a screenshot and record a video. We write that to S3. Most folks will never view any of that as most checks pass and these artefacts only become interesting when things fail.
Enter a new product feature we launched (Visual Regression Testing) which requires us to fetch an image from storage on every "run" we do. These could be every 10sec. This is where R2 shines. No egress cost for us. It's been rock solid and saved us about 60x compared to AWS. Still, we run most of our infra on AWS.
I do not and I'm likely misusing the word subsidized.
My concern is that as a newer product (R2 in 2022 [1] compared to S3 2006 [2]) R2 has deliberately priced itself to compete with egress pricing of S3 in order to gain market share and developer mindshare. I am not confident Cloudflare will maintain this competitive pricing indefinitely as I expect it to follow well established industry trends of jacking up prices once a walled garden has been sufficiently establed.
Further its my opinion cloud costs have grown at an absurd level as engineers and executives made poor and frankly lazy technology choices for the last decade.
Ultimately I like cloudflare a lot but I think we need more discipline and lower operational overhead if we want infrastructure development to remain practical to individuals and small businesses versus mega-corps. Cloudflare with its free pricing tiers is often a default choice for these organizational sizes but it should not be viewed as a panacea and carries tradeoffs as with everything in life.
I wish posts like this would explore the relative savings rather than the absolute. On its own I don’t feel like that saving is really telling me much, taken to the extreme you could just not run the service at all and save all the time - a tongue in cheek example but in context is this saving a big deal or is it just engineering looking for small efficiencies to justify their time?
I'm the author of the post. You raise a good point about relative savings. Based on last week's data, our change reduced the task time by 40ms from an average of 3440ms, and this task runs 11 million times daily. This translates to a saving of about 1% on compute.
> This translates to a saving of about 1% on compute.
Does this translate to any tangible savings? I'm not sure what the checkly backend looks like but if tasks are running on a cluster of hosts vs invoked per-task it seems hard to realize savings. Even per-task, 40 ms can only be realized on a service like Lambda—ECS minimum billing unit is 1 second afaik.
I think that’s flawed analysis, if you’re running FaaS then sure you can fail to see benefit from small improvements in time (AWS Lambda changed their billing resolution a few years back but before then the Go services didn’t save much money despite being faster) but if you’re running thousands of requests, and speeding them all up, you should be able to realize tangible compute savings whatever your platform.
Help me to understand, then. If this stuff is being done on an autoscaling cluster, I can see it, but if you are just running everything on an always-on box for instance, it is less clear to me.
edit: Do you have an affiliation with the blog? I ask because you have submitted several articles from checkly in the past.
Hey Checkly founder here, we changed our infra quite a bit over the last ~1 year. Still, it's mostly ephemeral compute. We started actually on AWS Lambda. We are on a mix of AWS EC2 and EKS now, all autoscaled per region (we run 20+ of them).
It seems tiny, but in aggregate this will have an impact on our COGS. You are correct that if we had a fixed fleet of instances, the impact would have been not super interesting.
But still, for a couple of hours spent, this saves us quite some $1Ks per year.
The units seem wrong in any case. It's 3 months of compute per day, which is actually much more impressive.
If we think about the business impact, we don't usually think of compute expenditure per-day, so you might reasonably say, the fix saved 90 years of annual compute. Looks better in your promotion packet, too.
I often ask myself the same question. We have some user facing queries that slow the frontend down. I’ve fixed some slowness but it’s definitely not a priority. I wonder how much speed improvements correlate with increased revenue by happy customers.
Hey, I work at Checkly and asked my coworker (who wrote the post) to give some more background on this. I can assure you, we're busy and this was not done for some vanity price!
Not a problem, but the OP is questioning about the savings!
I, for example, like to dive more on insights like the relative savings vs absolut to learn the approaches other engineers take! It's all about metrics we should take care.
(I'll put this service on my list to try someday, looks like fantastic indeed)
Thank you for pointing that out! You are correct, μs stands for microseconds, not picoseconds. I've corrected the mistake, and the update should be visible as soon as the CDN cache invalidates.
Is latency the same thing as duration? I think of latency as being more like a vector-with-starting-point (a “ray segment”?) than a scalar, it’s “rooted” to a point in time, so it doesn’t make sense to sum them.
Given how high frequent this thing is, I'd say it's worth exploring moving away from Node; I don't associate Node with high performance / throughput myself.
Here's the PR that got merged earlier in the year: https://github.com/open-telemetry/opentelemetry-js-contrib/p...
The env var config is `OTEL_NODE_ENABLED_INSTRUMENTATIONS`
Anyways, love Opentelemetry success stories. Been working hard on it at my current company and yielding fruits already :)