Datadog prices are out of this world. I assumed their front facing prices were for the lone dev, so when I reached out as a company looking to integrate multiple of their products I expected a deal, and got very little compromise. I told them up front they were going to have to cut prices by 90% for us to consider them -- no budge. And they are pretty belligerent salesman, not wanting to leave me alone. After a while I just blocked their domain to my email.
1. use Datadog, because it gets you a bunch of stuff without having to really set it up, like anomaly detection, which is poor man's monitoring & alerting
2. once you start getting product-market fit and the number of instances you run grows you notice your monthly bill going crazier and crazier and you now have something that starts to resemble an ops team -> migrate to a different product, set up proper monitoring
Right, the monitoring cost with datadog of my little business would be 7 figures (and revenue is only 7 figures!!) I mean, I guess they want to exclude people like me.. but then why market to me?
You should check out SigNoz (https://github.com/SigNoz/signoz) - It's an open source alternative to DataDog with metrics, traces and logs in a single application. You can just self host it yourself or try the hosted version.
I'm not familiar with pricing but depending on alternatives maybe this isn't so bad?
> and you now have something that starts to resemble an ops team
Like what if Datadog just replaces your ops team completely? What if we start to see AI tools that do cost a lot of money but they can replace a team? Just curious.
Datadog does one part of an ops team, even if it could do all functions of an ops team I would make convincing arguments about your sovereignty of fixing your issues and being resistant to price gouging
(Disclaimer: I work at Chronosphere, a Datadog competitor) This is a big issue in the observability space. We have written a few blog posts on this, but basically it’s easy to fall into a trap where cardinality and high dimensional monitoring causes your metrics to pop, causing costs to skyrocket. You have a few experiments, are running a bunch of smaller k8s pods per cluster and whoosh! you might be looking at millions, rather than thousands of time series that you're sending to your provider. Most vendors won’t provide tooling or suggest ways to reduce these costs, b/c they have no economic incentive to do so…. Anyway, bottom line is that no one should have to pay more to observe a service than to operate it.
Also: it’s 2023. Every company needs to be getting compatible with open standards like OpenTelemetry, Prometheus, etc.
Was Datadog pricing per-host? If so, then I guess running a Kubernetes cluster using the biggest available instances is the most "modern" solution to Datadog-using infrastructure.
It is mainly per host $15 or $23 PCM with the first 5-10 containers free then $0.002 per hour (~$1.5 PCM) per container. The insight and stats you get are quite granular and valuable however. For large scale deployments you can ignore certain containers etc.
Just wait until they find your wife's phone number. (OK, this wasn't datadog, but some similarly aggressive headhunting agency went from calling me to calling my wife - who is in no way affiliated with my company - to ask her if she could tell me about how they could help me in hiring, yadda yadda.)
I had something similar happen with a Facebook recruiter. I didn't reply to her first two emails so she started emailing my mom to try to get in touch with me. My mom called me because she thought it was a phishing email. I had never had that happen before.
I have started invoicing companies for wasting my time and also threatening them with CAN-SPAM complaints for failing to include an opt-out, an actual mailing address, etc.
Then, when my invoice isn’t paid, I threaten collections on them personally and the company. Usually that solves it. Then I’m “such a dick” but highly effective in recovering my time.
Back in the days when this kind of thing happened on the telephone, the SERIOUSLY passive aggressive trick was to talk to them, and then hang up on yourself. Repeatedly.
"Hi this is Arnie from CHewemup'n'Spitemout Staffing, is this Bob?"
"Hey Arnie, what perfect timing! I just started looking for a new opportunity, and I'm really excited to— CLICK."
Ring ring.
"This is Arnie, we seem to have been cut off."
"Oh Arnie, right, thought you hung up on me."
"No, not me, must be a bad connection. You were saying?"
"Yes, I was saying this is a great time to talk about opportunities. I just finished a major Java Enterprise JavaBeans project, and I'm— CLICK."
Lather, rinse, repeat, as the meme used to go before we called them memes.
oh man, that's great-- you could probably get through a LOTT more CLICKs before they get the point. what a great use of human psychology-- they assume you have good intentions because you called them... it's a bit devious, but i'm gonna have to add that one to the toolbox hahaha.
Ive worked in sales for over 10 years. Incel is a weird insult for salespeople, as I've never met a group of people that sleeps around more than salespeople. I spent years selling cars even, and if you bring a gf or wife you're having trouble with to go car shopping... Odds of her later sleeping with the car salesman are not zero to put it mildly.
Nowadays, incel is synonymous with the Andrew Tate “top g”/alpha sex obsession. Backwards? Definitely, but more symbolic of the arrested development/lack of maturity than the lack of sex.
Sleeping with others’ significant others, or actively trying to/bragging about it is a perfect example. Low-class behavior that will get you props from low-class folks.
No one, or atleast very few, were sleeping with someone's significant other. Who wants to do that and then be a sitting duck at work where the angry bf/husband can walk in and see you any moment. When the women are ready to leave the bfs, that's when they contact the salesperson. I've seen it happen atleast 10 times, 2 to me personally. Even dated that nightmare for 2 years once.
But go ahead and label everything you misconstrued as low class.
I've certainly heard that fans of Andrew Tate are incels, or that their world view and advice is built for incels, but I've not heard of many people calling Andrew Tate himself an incel.
I'm far from an incel (thank you height and sales skills), but I've been around the 'manosphere' online even since early "Ladder Theory" and before pickup artists learned how to internet market.
Reality is some guys are not genetically blessed PLUS they have been lied to when it comes to what women really want. If you take an average real incel, and give him advice from his mother+sister+female friends, and advice from Andrew Tate, I promise you he will get further with Andrew Tates advice.
His mom's advice MIGHT get him a girlfriend that uses him for money, and divorced later in life.. andrew Tate will atleast tell him to get actually attractive through working out and performance at work/business.
I was never sure of what exactly Datadog did, so I looked at their pricing. At first, I thought "$23/month/host isn't THAT bad...", then notice that was only for one product.
If you used their full suite, those costs could REALLY add up.
Especially when you're realize that the billing features have been turned up to 11 on turn-key rollout. Sampling set at 100%, gosh that's expensive right out of the box, but no accident.
This is the company that released synthetic API checks at like 100X the price of Pingdom checks. Ended up using their free Pingdom integration to pull in the check data..
The best way to deal with them is to never engage with their sales team. Use what you need, don’t let them talk you into more.
Every year, I get a barrage of phone calls and emails from them, and I actively choose to not engage.
Pricing has remained pretty much flat aside from the expected growth. They work hard to get you to overcommit and overpay.
Also, you don’t need all your logs indexed. Saved a company I’m under contract with a massive amount of money (10s of thousands/month) by pointing out that you should just index (sample) a percentage of them to identify if there is a trend, and you can rehydrate later.
Sentry is python and has extremely reasonable costs (and you can even self-host it).
Honestly if you want to complain at the cost of hosting (which, we don't know if they would) then licensing the software and allowing people to self-host would be the solution.
$65M is enough that I could fund a team running google's monarch system for 7-8 years.
We tried Datadog, and due to how they bill, our DD bill was more than the services being monitored. (At the time, we used a lot of short-term spot instances, which billed by the second, but DD billed by the hour per instance)
We reached out to DD to try to work something out, and they agreed to cut the bill in half, but only IF we signed up for additional services.
I will never use their services again, and will always share this story when their name is brought up.
Yup, they did that to our team too. Calling my coworker during the evening, emailing me and messaging me on LinkedIn at the same time. When I said "no thanks" they replied with a message to set up a meeting. Went from a maybe someday to a never real quick. They really need to rethink their marketing.
My company came up with a structure where we have an overall quarterly revshare in replace of bonuses for the entire company. It works where engineers get X% and sales reps get ~4 x X% of the revshare pool for each team. Of each team's bonus pool, 2/3 is given out guaranteed based off a few tangible factors and then the remaining 1/3 is given out to individuals who have performed above & beyond.
The idea behind this are a few-fold but essentially:
As an engineer (now CEO/CTO), I've hated having to wait the full year for my bonus. It's just a way to lock me in for the year when my incentive to stay should be to love the work & team. I don't want to create a place to work where you're forced to stay because of some guaranteed bonus - if you want to leave, leave & then let's hire someone who finds the work engaging + we all know performance slips as you wait for the bonus.
For the sales team, it means they're incentivized to work with the engineering & product teams to make sure they get the engineers the proper feedback in order to build a better product that they can sell more easily.
We've found this has generally built a better more team-oriented & results-oriented culture. Happy to expand but overall I think a quarterly revshare for everyone is a much better end-result (other than the fact I'm now forced to care more about making sure engineers are happy but that should be a huge focus regardless...).
Edit - also worth noting that we give everyone equity so there's still a long-term focus of building a company, not just cashing out quickly.
Google and Facebook ads support folks do the same thing, I'm guessing they have some incentive and some metrics to hit, such as number of people who they actually talk to etc.
You can probably pervert those metrics to something that looks good in some slide decks. While actually destroying the image with most potential customers.
OpenTelemetry is going to be an existential threat to DataDog and other companies that effectively rely on vendor lock-in to exploit customers. Not sure how companies rationalize these types of services at scale when there are so many open source options to run for a fraction of the cost. You could hire 100+ engineers and still save money compared to a 65M bill
OpenTelemetry (or OTel) is in no way an existential threat to DataDog. Primarily because OTel is simply the substrate/protocol by which data is collected from your apps/systems. DD does _a lot_ more than what OTel provides (RUM, SIEM, synthetics, on-call, dashboarding, anomaly detection, and much much more). If anything it's an existential threat to the more legacy vendors that aren't equipped to provide an OTel ingest layer.
OTel _does_ prevent lock-in on the agent side (making it easier to switch vendors) with open source components and consistent schemas, but OTel doesn't enable you to do anything that you couldn't do before with a specific vendor. It just empowers you to take ownership of your observability data, should you want to. Many don't, though. They want to throw money at someone else who can do it relatively well, hence the ridiculous DD pricing.
Regarding:
> Not sure how companies rationalize these types of services at scale when there are so many open source options to run for a fraction of the cost
At Coinbase's scale (which I assume is a lot due to the DD bill, but I haven't looked closely at it) the open source options simply won't cut it. Plus there are no scalable open source options for many of the things that DD does (synthetics, SIEM come to mind - not to mention onerous regulatory requirements). 65M seems like a lot, but it also means their cloud costs are insane - so maybe it lets them put focus elsewhere?
I don't know the exact history of OTel, but I agree with you. A few years ago, open source applications basically included one type of telemetry: Prometheus metrics. (Maybe Jaeger for tracing if you're really, really lucky.) What OTel is is a way to write an open-source library that emits telemetry, but without dictating that the end user use a particular storage system. Basically, you can use Datadog instead of the open-source stuff, even if the library authors have never heard of Datadog.
Datadog specifically, I don't know if they care. They had an army of junior engineers that existed to hack Datadog into every open source project imaginable. If something had monitoring, they would just add their vendor-specific stuff and upstream it. That was probably expensive but probably accounts for a vast amount of their early marketshare. The other vendors wanted in on that racket without having to do too much work; OTel was born.
> OTel _does_ prevent lock-in on the agent side (making it easier to switch vendors) with open source components and consistent schemas, but OTel doesn't enable you to do anything that you couldn't do before with a specific vendor
This is exactly the reason why we are moving away from NewRelic's SDK to an OpenTelemetry SDK (even though we are still using NewRelic to ingest everything). If (more like when) we decide to switch vendors, it will be much easier to do so.
> OpenTelemetry (or OTel) is in no way an existential threat to DataDog. Primarily because OTel is simply the substrate/protocol by which data is collected from your apps/systems.
As a vendor building in this space [1] - it definitely is. We're able to onboard teams faster to do side-by-side comparisons because they can simply point their existing Otel telemetry to both us and their existing provider with just a few lines of config. That wasn't possible before otel, and levels the playing field more than before. As otel matures, it'll continue to erode against DD's position.
It also allows us as a company to focus on what users care about (as you mention that's things like dashboarding, search performance, RUM, etc.) as opposed to spending all our time building basic integrations into every platform (though we still do plenty of work to polish places where Otel hasn't). Again, levels the playing field.
So your claim is that Datadogs pricing is ridiculous and that Otel allows you to not be locked in, but somehow Otel isn't a thread to Datadog? I don't follow.
The only part of overlapping functionality between DataDog and Otel is the agent.
In theory you could use the Otel Collector (or any other Otel-compatible agent) instead of the DD Agent to collect metrics/logs/traces. This would then make it easier for you to switch from DD to another Otel-compatible provider (Grafana, for example)... but 99.9% of what DD provides is _not_ the agent, it's dashboarding, alerting, RUM, synthetics, etc.
Basically Otel has made _agent_ switching costs effectively drop to zero, but that is a very small part of the whole picture. Like I said above, this primarily hurts vendors with proprietary agents that can't/won't adopt Otel for ingesting data.
We are building SigNoz (https://github.com/SigNoz/signoz) - an open source alternative to DataDog. We are natively based on opentelemetry and see lots of our users very interested in that.
As mentioned in some other places in the thread, DataDog pricing is very unpredictable and high - and I think more open standards based solutions are the way forward which provides users more predictability and flexibility
I’ve seen an attempted move from DD to OT and it was a nightmare of undocumented features and little compounded issues. Tracing was non functional. It doesn’t seem mature enough yet.
I am a user of New Relic. Not because I'm happy. But because OpenTelemetry doesn't come close to the same features. Fortunately, at least OT is about 10x harder to set up with worse documentation.
yup at my last company CTO wanted to switch away from New Relic to save money for Open Telemetry and I can tell you we wasted months of work because OpenTelemetry is dogshit.
New Relic is really really good too, so it was even more painful.
I am not sure when you tried OpenTelemetry, but it is decently mature now, esp. for tracing. I am a maintainer at SigNoz (https://github.com/signoz/signoz) and we have good support for tracing using Otel for most of the common frameworks.
I agree it was a bit rapidly evolving in early days, but now its much more mature.
We had to use otel with Datadog because Datadog did not support Elixir with an official SDK.
There are a lot of subtle issues, but we've been able to work through them to get usable traces. (Metrics and log ingestion already have pretty good existing open-source tooling, like statsd).
There's some purposeful friction with DD when it comes to OTel.
Incumbents in this space are in for a rough time as more applications provide meaningful telemetry, beyond just logs. Fortunately for them that timeline is 'fuzzy' at best.
> You could hire 100+ engineers and still save money compared to a 65M bill
I see cloud costs like this a lot and it really puzzles me. It seems like people would rather pay 10X+ more to just not have to think about it than even to hire other people to think about it, because then you have to think about hiring and HR.
"Here's a blank check. Just make it go away."
Of course corporate consultants run on that, so I guess it's not without abundant precedent elsewhere. I guess if you work for a big company with budget and it's not your money you really have little incentive not to take the easy path.
This is indeed the case and one can argue that it makes sense for a small company to focus on the MVP and initial growth. But every such decision needs to be reexamined from time to time as the company scales up. That is unpleasant, requires an expert, so many businesses procrastinate on this and make expensive mistakes. My 2c.
Operational costs can be billed to a project. It's not really that the business doesn't want to save money on this stuff, but it's much easier to wind up in this situation when the loops to hire some engineers are 4x as complicated as just adding another sub-org to the data dog billing..
I would argue it's a much easier blank check to write than for consultants. Software complexity over time is a major headache and new initiatives happen all the time.
Take for instance GDPR - in AWS it was a company wide effort to get all the services GDPR compliant and that was basically a non-existent pricing change to consumers.
Also the fact that I can call up AWS support and have them look into a bug immediately with real devs on the other end is invaluable when my business needs rely on a certain feature working.
I was previously in sales and SaaS is considered the pinnacle of industries to be in for sales people. Medical device sales is the only real competitor when it comes to earning potential and my understanding is that's US specific and also comes with atrocious work life balance.
Back in the day I was auditing our support contracts for a place that I worked at. Basically figuring out if we were getting what we were paying for.
My favorite "overpriced support contract" was for an Oracle product. The cost of support was seven figures, and in the entire year a single phone call had been placed to support.
The best bit when I worked at one of those companies that had an expensive Oracle contract was this dynamic:
1. Can we use MySQL for <new product>?
No, use Oracle, we have a support contract
2. <oracle related issue occurs> Can we call that support contract in now?
No, let's try the inhouse expertise first.
3. <inhouse expertise comes up with barely passable hack> Can we check if they have any better solutions?
No, it's "solved" now
Like, what were we paying for? I have to assume there's per-engagement costs as well as the ongoing costs, given how hestitant our contract owning team were to let us anywhere near Oracle.
Having recently dealt with a surprise runaway DataDog bill and personally completed lots of work to reduce our spend internally all I have to say is that their pricing is outrageously high and there is almost no way to put controls in place to prevent overspend.
If you're considering using them keep that in mind and tbh I would strongly recommend considering if CloudWatch or some other cheaper alternative is suitable for your needs.
I feel nostalgic for the days of grepping logs in some ways.
At a previous company, we had DD, and I got asked to find some problem. I was able to sift through the data in it and zero in on the instance and find the bad data that came in.
Then it got expensive and so they turned on 'sampling' and the next time I was asked to look for a particular problem, we had no idea if it was even logged.
Our company helps avoid these kinds of observability bills and issues like scaling for fast-growing cloud deployments. Generally speaking, many vendors let you fall into the cardinality trap b/c they have an economic incentive to let you do so. One of our biggest selling points is that we provide an observability control plane that helps drill down into wasted queries, shows how metrics can be aggregated, and other ways of avoiding wasted cost/effort. tbh no one should have to pay more to observe a service than to operate it. Where’s the ROI in that? Another plus is that we're all in on open source instrumentation with OpenTelemetry & Prometheus so none of that annoying vendor lock-in.
None of this should be even remotely necessary. It’s like being frugal with table salt.
“We’ll show you how to make sure you don’t have even one crystal fall off the plate.”
My personal pet peeve is Azure Application Insights which uses Log Analytics under the hood… at a rate of $2.75 per ingested GB of logs stored for one month. That’s highway robbery.
Let that sink in: They charge $2,800 to store a TB of text that takes a few hundred dollars of overpriced cloud disk and maybe $10 of CPU time for the actual processing. That’s the cost of a serviceable used car or a brand new gaming PC!
But wait! There’s more.
In reality that 1 TB is column compressed down to maybe 100 MB, making it about $30K charged per terabyte stored on disk.
It doesn’t stop there! Thanks to misaligned incentives, the ingested data format is fantastically inefficient JSON that re-sends static values for every metric sample collected.
Why would anyone ever bother to optimise their only revenue?!
They won’t.
The reality is that a numeric metric collected once a second (not minute!) is just 21 MB if stored as a simple array. Most metrics are highly compressible, and that would easily pack to 100 KB per metric per month.
A typical Windows server has about 15,000 performance metrics. We could be collecting these once a second and use a grand total of… 1.5 GB per month. That’s every metric for every process, every device, every error counter, everything.
Modern server monitoring is inefficient and overpriced by 5 orders of magnitude. It’s that simple.
That fact that your company can exist at all is a testament to that.
Totally agree about the compressibility of metrics and toying with the scraping interval. I started out working for an enterprise monitoring vendor that had a proprietary agent that already decided sane intervals to emit metrics, when I learned that Prometheus let users configure that to me...just sounds like an expensive foot gun.
My real beef with metrics is at least for app layer insights is the waste. I'd so much rather have a span/event configured with tail sampling so you can derive metrics from traces and tie them to logs in a native contextualized way vs having to do that correlation on the backend and within different systems and query langs. Seems much more efficient and cost-effective that way, I'm scarred from seeing a zillion "service_name.http_response.p95.average" metrics that are imo useless
I’m starting to come to the same conclusion, but the point I’m making is a general one: efficient formats would allow finer grained telemetry to be collected without having to be tuned and carefully monitored.
What’s the point of a monitoring system that itself needs baby sitting?!
Folks on this thread might want to check out SigNoz (https://github.com/SigNoz/signoz). It's an open source alternative to Datadog.
I am one of the maintainers at SigNoz. We have come across many more horror stories around Datadog billing while interacting with our users.
We recently did a deep dive on pricing, and found some interesting insights on how it is priced compared to other products.
Datadog's billing has two key issues:
- Very complex SKU based pricing which makes it impossible to predict how much it would cost
- Custom metrics billing ($0.05 per custom metric) - we found that custom metrics can account for up to 52% of the total billing which just does not make sense
Datadog is a pretty amazing product, and if you are careful and use it in the right way, it is very powerful, and cheaper than rolling your own LGTM Grafana stack (or similar). If you are not careful, or at a decent scale, you can easily spend obscene amounts of money. The metrics pricing is completely insane for example, and its easy for people to emit high cardinality metrics from apps and explode your bill. I think by this point you need to run an internal solution, and that is when it makes sense to double down on a combo of elastic, and grafanas stack for logging, tracing and metrics.
The internet is mostly machines talking other machines or monitoring what other machines are monitoring. But then you can also pay for machines to watch the machines that watch the machines.
Seems like there's opportunity for compition in SaaS monitoring then. I'd imagine a few small efficient teams could beat the price on the core part of what datadog offers. At 5000 employees Your probably paying for corporate bloat
There are plenty of competitors in the observability space, what's another one? The real issue is once your company is on a platform, it's very costly to move off. The biggest consideration is that it won't be a drop in change for all employees, so the retraining needed is substantial across all teams. Far easier to train employees in cardinality, and the cost implied by it, and to expose the cost for their particular monitoring to their teams.
This may come as a surprise but when giving money to a for-profit company, not only are you paying for corporate bloat, but you're also paying for the CEO's lavish compensation package, free lunches, and very costly health insurance for their employees. You're even paying for employee salaries while they're not doing work while on vacation!
If that's a big problem for you, Graphana may be the better product for you.
I don't consider that corporate bloat. That's just life in the software business if you want good talent.
Corporate bloat is like the mini empires people build, headcount for the sake of headcount, that guy who's been here forever passion project that doesn't make money. Process because it helped someone's resume. Those kinds of inefficiencies. This stuff is different then treating employees nicely.
Thats completely wild. Did coinbase go into trace mode and log every blockchain append to datadog? I cant even figure out how anything could cost that much.
I’m not gonna claim that Datadog is cheap, but that screams “we didn’t bother to optimize our usage.” Lots of logs, long retention maybe? Really heavy RUM with the replay feature turned on?
Depends on what you want to monitor. Grafana is pretty decent, but the real draw to Datadog is their APM stack. The UI for tracing and looking at stuff is pretty awesome.
Though you could get most things into Grafana with something like Prometheus. The problem with Grafana is understanding what the limitations are. If you're not careful with the number of panels and such it can get quite slow.
I've used Grafonnet before for doing Grafana at scale. Simply put, I hate it. Apparently an alternative is being worked on at Grafana so I'm waiting for that. But if you need to make hundreds of panels....it works well enough.
If you need to monitor some infrastructure you can just use Telegraf and output it to Grafana if needed. It kinda falls apart though because another great benefit of something like Datadog is not managing a time series db. That can get ugly real quick.
I guess it all just depends. If my bill was super high I wouldn't mind spending some resources on Prom/Grafana if you're in the Kube space or some Telegraf/InfluxDB if you're not.
I've also heard good things about Timescale but haven't used it.
> I've used Grafonnet before for doing Grafana at scale. Simply put, I hate it. Apparently an alternative is being worked on at Grafana so I'm waiting for that. But if you need to make hundreds of panels....it works well enough.
Hi, I run the Grafana team at Grafana Labs. I'd love to learn more about your Grafonnet use to help us build something better. I'm david at grafana com
Depends a lot what sort of scale you are on too. Grafana Cloud will be cheaper than DD but is not quite as end-user friendly.
Running it yourself is not too hard up if you are not having to do clustering ( say 1m metric series, 100GB/day logs). But different people have different comfort levels for that.
With any monitoring system most of the work is actually making use of the data. Tagging, Alerts, Dashboards and especially onboarding all the teams. You can spend a lot of time and money rolling something out and then barely anybody uses it.
(disclaimer, I'm with Grafana). We added a lot to our free Grafana Cloud so you can kick it pretty hard (and harder during the first 14 days when everything is beefed up). Free tier comes with 3 Grafana front end users fully managed, backend (with storage) 50gb Loki logs, 50gb Tempo traces, 10k monthly active series prom metrics, IRM/on call, k6 user testing hours and other stuff too. And for the quick solution integrations we made a K8s monitoring solution with out of the box dashboards, KPIs and alerts. Same thing we did with many others. We absolutely have more work to do in simplifying the user experience too.
Plug: If you're looking for something a bit more "few-clicks-and-you-are-up-and-running", check out OpsVerse ObserveNow: https://opsverse.io/observenow-observability/ .. Entirely powered by OSS tools, ingestion-driven pricing, and without the hassle of managing the stack and scaling up.
Best of all, can also be run entirely within your own AWS/GCP/Azure so you only pay OpsVerse for maintaining the stack based on your ingestion (and we also monitor the monitoring system for you ;))
My small team had to choose between New Relic and DD and I found New Relic's billing model to be more appetizing. It was per seat and you could switch who was in the seat. Unlimited instances and most of the features were covered under that seat besides some extra things like HIPAA / Finance related stuff. They also have "regular" users that are free that can make dashboards and such. DD drove me nuts with their crazy amount of sales calls that just seemed to balloon.
For others reading this - you can’t just switch back and forth a few times a week. A full platform user can be moved to a basic user only twice in a 12-month period.
Thanks for mentioning qryn! We are a non-corporate alternative and feature full ingestion compatibility with DataDog (including Cloudflare emitters, etc), Loki, Prometheus, Tempo, Elastic & others for both on-prem (https://qryn.dev) and Cloud (https://qryn.cloud) deployments, without the killer price tag.
Note: in qryn s3/r2 are as close to /dev/null as it gets!
I'm saying most logs are pointless to keep and would be better directed to /dev/null. Keep important transaction related logs and sample the rest.
The notion that every single log or metric across your entire technical architecture is worth keeping is one implanted by SaaS providers with a financial interest in naive engineers doing just that.
Kubernetes with Graphana is free. The combo provides logging, performance stats and graphs and lets you auto-scale based on usage.
Unfortunately, avoiding insanely costly SaaS solutions requires engineers to plan ahead and design the entire stack on top of certain open source solutions. I suspect that many engineers today receive kickbacks from SaaS providers to lock-in their employers. Employers are none the wiser and rarely push back when an engineer suggests a big-name SaaS solution with insane lock-in factor. Nobody seems to care about lock-in these days, it's only when your costs reach almost 100 million and interest rates are going up that you start thinking "Damn, I could have had all that for free if I had planned ahead and resisted all these platform lock-ins and unnecessary proprietary tools..."
> Unfortunately, I suspect that many engineers today receive kickbacks from SaaS providers to lock-in their employers.
Cmon man, really? Drop the conspiracy theories. I’ve personally been the guy advocating for datadog at
4 startups. Mainly because of opportunity cost - we have 10-100 engineers, I want them building product not figuring out how to deploy a whole ecosystem of observability tools. IF we get big let’s reevaluate… but in the meantime.
am I doing it wrong? If others are getting kickbacks I want in
The difference between datadog and doing it yourself is that datadog is a well thought through product rather than a cobbled together set of various tools
Having a single interface for everything makes life so much easier across a number of different teams
Search is fast and easy to use for logs and traces
Being able to see what a user actually clicked on in their session is absolutely game changing for support teams
I’m not a huge fan of the bill but it’s so much better than anything we could do ourselves without a team of engineers dedicated to observability (which would cost far more than datadog)
I do a lot of negotiating on products like this. The most I've ever gotten was a shirt and some stickers for my kid. Definitely not enough to move the needle on $250k/year deals. I feel like I'm missing out!
I love how good DataDog is. It's a great product. Too expensive though. I love most of the people I've worked with at Grafana Cloud but it's a painful product. The price makes up for it though, so we use Grafana Cloud.
We may end up with something like signoz, when we have the cycles but the ROI is bad when I already have twice as much work as people and that barely more than KTLO.
General usability. DataDog is intuitive and easy. Grafana is rough and requires a solid understanding of statistics and data analytics. The bar for using it is pretty high, so most engineers I know push it to someone else, which means there's one team doing the toil and working on creating simpler abstractions to hide the complexity.
HN has a tendency to explain every little thing with conspiracy theories. It can’t be a clear explanation based on incentives and people taking the path of least resistance, it must be malice. I’m not a psychologist, I don’t want to psychoanalyse why they think this way. But it is a bit tiring to interact with such people.
People who aren't harmed by these things don't notice them. Their incentive is to ignore as much as possible. Turning a blind eye is literally the safest option. But when you've had it rough, you literally can't stop seeing this stuff everywhere.
If you change your mindset from ignoring problems to looking for problems, you will find that there are problems everywhere. I'd rather be biased in that way than in the former. In my position, I can't afford to ignore even the tiniest problems.
> we have 10-100 engineers, I want them building product not figuring out how to deploy a whole ecosystem of observability tools. IF we get big let’s reevaluate…
Moving away from those SaaS tools can be extremely painful and a lot more costly due to vendor lock in. In practice, typically, this "let's reevaluate" time never happens.
On the other hand, I don't really care. I normally suggest open source tools, but if people want to throw money at some vendor, fine by me.
Obviously SaaS providers will not offer kickbacks to startups, the deals aren't usually big enough. I've witnessed it in a big corporation once. One of the engineers was VERY insistent on using a specific solution even though it didn't make sense technically and everyone else was against it but because they were more senior, they made the final call. If they don't get outright bribes, they will get lucrative job offers from these big companies in the future.
Imagine being the guy who convinced Coinbase to use DataDog... That person will probably end up working at DataDog sooner or later if not already there... You can bet they will be getting a very cushy salary.
I could probably make a living out of extorting corrupt engineers. It's so predictable.
And hiring someone corrupt enough to sell out their previous employer for their current one is rarely a smart move, as they are liable to do the same when angling for their NEXT job.
A lot of this discussion reminds me of this talk:"Netflix built its own monitoring system - and why you probably shouldn't"
(https://www.infoq.com/presentations/netflix-monitoring-syste...) where Roy Rappport describes Netflix as a "monitoring system that happens to stream movies"
As someone who spent a few years at New Relic and Lacework, I can also say that pricing observability fairly is crazy hard when you account for different architectures, usage pricing, and the humans experience the value.
Is the speculation that this was Coinbase just based on Coinbase being a big crypto company? I see nothing in those messages that implies who the customer is (rightly so) and I am wondering if there's some other information I'm missing.
You should check out SigNoz (https://github.com/SigNoz/signoz) - It's an open source alternative to DataDog with metrics, traces and logs in a single application. You can just self host it yourself or try the hosted version.
PS: I am one of the maintainers at SigNoz
We attempted to migrate from datadog to prometheus at GitHub and that stack did not cover our use case at all. So much tooling had to be recreated. I took a lot of flak when I pointed out numbers made sense to stay on DataDog and migrate to a Microsoft product instead, but the cost savings spoke for itself
Sentry is strictly code-first APM, which is only a part of what DD provides. What "APM" _is_ can get kind of blurry, but they are not direct competitors in a meaningful way.
This doesn't surprise me much. From what I've seen consulting/contracting, SaaS-based observability tends to cost 30-50% of cloud spend--EC2, storage, S3, RDS, maybe k8s, and other cloud services, or whatever the equivalent is on GCP/Azure. I wouldn't be surprised to see Coinbase with a >$150M quarterly cloud spend, so $65M on observability would make sense.
That said, managing observability yourself should result in <5% of cloud spend. So I'm figuring someone at Coinbase said "WTF" to this bill and migrated to Grafana/Loki or Kibana/OpenSearch or Kibana/Elastic. Well, that, and Coinbase's business also dropped off a cliff. Combined, I could easily see a one-time influx of $65M from one customer, gone the next quarter.
Yeah seriously, normally the argument is: 'it will require N engineers to run our own and they cost N * 250k/yr...' but for 65million you could fund 5 Datadog competitors and still come out ahead.
What kind of scale of logs are we talking here? The company I work for run a self-hosted Grafana LGTM stack ingesting about 1TB of logs per day, it’s pretty snappy and works well enough, and only costs a few thousand dollars per month in GKE costs for the entire observability stack.
GitHub has over 21TB of source code. Applications consistently pour through this data and emit logs and events. 1TB of data by breakfast maybe? In reality, we're not pushing logs to datadog, just metrics and event tags. Our level of cardinality, however, requires a lot of horsepower on the backend. Our attempted Prometheus transition was just not cost effective when attempting to view large sets of data over a large-ish period of time. Combined with the heavy lift of integration (we depended heavily on dogstatsd) it just didn't seem efficient to move to Prometheus, support the infrastructure required, all while migrating to microsoft's inhouse product.
Having never heard of datadog, Wikipedia’s summary is:
> Datadog is an observability service for cloud-scale applications, providing monitoring of servers, databases, tools, and services, through a SaaS-based data analytics platform.
So it checks if your servers have crashed or slowed down with a nice dashboard?
Any better summaries or descriptions of what it does and how coinbase would have used it?
That's funny, DD is the only company to email my work email, add me as a connection on LinkedIn, txt me over WhatApp, and call my personal phone number multiple times. Amber flag after the LinkedIn connection/message, but red/purple after the WhatsApp/call on my personal phone.
Datadog from when I was looking at them (2017ish) appeared to be an automatic version of nagios with a nice user interface and super simple client side installer.
But it was super tied to VMs at that point, and we were running a bunch of lambdas, herokus and docker instances, along with a shit tonne of AWS services, and java lumps from the 90s
Suppose you have a bunch of k8s clusters, an AWS Organization etc. You just follow a simple setup and see a nice Dashboard with practically every aspect of your infrastructure, from accounts to nodes to pods.
it ingests and can help you sort through logs, and also does performance monitoring. I guess I would describe it as prometheus + logstash + powerBI in a single unit.
And with a healthy dose of blue cross bolted on for the surprise bills and difficult bureaucracy.
Observability is about more than crashes or slowdowns, serious investment in observability is a must-have for any SaaS/cloud product to have reliability, auto scaling and velocity. It’s more than just crashes/slowdowns.
My team use Grafana’s open source LGTM stack. We use Prometheus metrics to track anything from JVM/Go runtime stats, K8S metrics, saturation of CPU/memory, scalability issues, crashes/OOMs, custom metrics for business insights, debugging. We use USE/RED metrics (see: Google’s SRE handbook) to track our production services performance in an objective way. We track SLAs and SLOs so we know when it’s time to focus on features and business impact, and when it’s time to put that aside to focus on stability and maintenance before our customers notice reduced reliability.
As a developer it’s really helpful for testing changes. For example, I added a new database index in dev, then run some load tests and check our dashboards before and after. I look at Q95 latency of APIs and database load to see if it has the desired effect, then when I roll out to production I can monitor those same dashboards and make sure the same desired improvement can be seen for real-word usage.
I used traces recently to discover that something that should have been happening in parallel was instead happening sequentially leading to very long/timing out requests. Adding visualisations via traces helps get your head around how something is working.
I added annotations to our dashboards that shows when our K8S pods restart alongside the metrics. This made me realise that some requests were failing exactly around deployments because we were not cleanly handling SIGTERM in some services.
We have started adding horizontal auto scaling based on metrics for the number of queued messages on a specific Kafka queue. If a large number of messages are waiting we spin up more K8S replicas, and then once this reduces, we reduce the replicas to keep costs down.
I optimise the resource allocations on our services by looking at historical CPU/memory usage so we make the best use of our K8S cluster and avoid OOMs as we scale.
We use Loki for log querying and parsing, you can create really advanced/domain-specialised log querying dashboards and provide that to your support team, and integrate those logs with traces to debug different stages of a request as it traverses your microservices or different processing stages.
You can even build dashboards from logs, which is helpful when debugging a particular type of error over time that you were not specifically monitoring with metrics, or determine which customer(s) are affected by this error. Alternatively if you have a legacy system that does not have effective metrics, you can build metrics from its logs.
We use our metrics for alerting and paging in a way that provides a better signal-to-noise ratio than old-school alerts like “high memory usage” so people don’t get woken up as much (we’ve had zero pages since my product launched 6 months ago!). It’s better to alert only when we have a measurable impact on customer experience, like when a smoke test has failed more than 80% of the time, or HTTP requests 5xx rate is elevated to abnormal levels.
It’s also really reassuring when you do a prod rollout to easily see that stuff is still working without digging into logs, so you can spend more time coding and less time babying prod.
Overall I think having good observability is definitely a worthwhile investment. There are cheaper ways to do it than datadog. I expect much of the trouble is that switching providers is a huge job, we have invested so much time building our observability stack, the challenge of moving seems massive. Thankfully we picked Grafana’s open source LGTM stack and self-hosted it. Even if you picked their SaaS offering, switching to open-source self-hosted is an option so you are less tied in.
Why does the tweet author think this is wild? Sure, 65 million is a lot, but plenty of companies pay large bills for their major cloud services (especially AWS/Azure and the like).
I'm not an expert on monitoring/observability/telemetry etc, nor an expert on Datadog pricing/billing, but paying a lot of money for major infrastructure components doesn't surprise me.
Imagine you had 65 great engineers to build out your companies observability infra. They can use and contribute to OSS tools like Loki and Prometheus. They can split off small teams to build brand new infra and tools. This is THIRTEEN teams of 5 engineers!
It depends a lot on your usage pattern, but we are switching to Granfana cloud from Datadog and are looking at about 1/3 the cost per year. This is using logs, metrics, and traces with OpenTelemetry instrumentation.
The Grafana pricing is more cleanly volume based that is disconnected from number of “hosts” which is where datadog really squeezes you in a kubernetes setup with many pods.
When I worked at a place that was all in on Azure, application insights was so we needed because we had no dedicated VMs just all built in Azure services (Cosmos, queues, blob/table storage and functions etc)
Not on purpose, but I could imagine someone at FTX signing up for datadog, configuring it to ingest their logs without doing any estimation or setting up any guardrails and then not checking on it because things were probably crazy over there.
Since 2022 Q1, the allowance for doubtful accounts has remained under $6 million. Bankruptcy generally triggers a writedown of any associated receivables, so Datadog appears to not have had any material exposure to FTX or any other bankrupt customer.
A previous workplace switched from this to a competitor, which had much worse graphing as far as I could tell. Seems like an important function for these tools! I wondered why they did that, I guess this is the answer.