Show HN: OneUptime – open-source Datadog Alternative

JohnMakin · on April 2, 2024

Multiple times in my career at a new job I've had to build a kind of bootstrapped OSS-based observability platform (mostly a mixture of prometheus and grafana typically) which can come with more overhead than you'd think and doesn't usually provide a lot of the analytical stuff without stitching together a bunch of things. Datadog out of the box gives you everything you can possibly want - as others have stated bills can balloon, for instance if you run spot instances or have a lot of hosts coming up/down I believe it bills you for each one even if it's short lived.

I've been working with an enterprise license for a year and while I don't really hear too much about cost, some simple considerations to the design of the infrastructure it was supporting seems to have prevented a ballooning bill (so far).

So for me, not having the engineering time or buy-in to build a whole home grown observability platform by using OSS tools like this (and all the quirks that can come with them) ends up being a lot more expensive than just sucking it up and buying an enterprise plan. At least so far.

If I had the option to do it from scratch how I wanted, with no time or budget constraints, I'd prefer of course not to be beholden to a major SaaS company that charges for ambiguous things that are hard to predict like "per host", because it's quite easy for these services to bury themselves so deep into your infrastructure that you just bite the bullet on whatever inevitable rug pull or price increase comes next. It has happened to me before managing enterprise Hashicorp Vault.

com · on April 3, 2024

Which simple considerations to the design of the infrastructure do you recommend to decrease the eventual cost of using DataDog?

jascha_eng · on April 2, 2024

I always liked Datadog as a product but it's also true that it is simply way too expensive if you don't spend significant time cost optimizing. But hosting it myself doesn't really seem like a great solution, I rather invest time in making my app robust than making my monitoring stable.

bibliotekka · on April 2, 2024

expensive and totally weird pricing structures make it hard to predict costs

bdcravens · on April 2, 2024

Haven't used their product in years due to terrible unethical sales practices, but when we did, they billed hourly while AWS billed by the second. As such, it was easy to have a monitoring bill that was much higher than the actual resources being monitored (for example, if you had a lot of instances that terminated and relaunched with short durations)

abhiyerra · on April 4, 2024

What I found with our customers is that Datadog is about 25% of an AWS bill. With all the addons, etc. I’m still unsure about how the pricing structure actually works since it seems to change from customer to customer. Seems dependent of who your sales person is.

magundu · on April 4, 2024

Few more services you can try

https://www.atatus.com https://tracetest.io/ https://uptrace.dev/

(I work for Atatus)

devneelpatel · on April 2, 2024

You dont have to host it if you dont want to, we have a SaaS service as well at https://oneuptime.com

AndyKluger · on April 2, 2024

I understand this is probably not a priority or concern at all, but FYI this is what the page looks like with uBlock Origin configured as what they call "hard mode:" https://cdn.imgchest.com/files/my2pc6adm87.mp4

devneelpatel · on April 3, 2024

Maybe its blocking tailwind files. Let me fix this. Ty for reporting.

ycombinatrix · on April 3, 2024

do you know which adlist is blocking the assets?

pranay01 · on April 2, 2024

You should also check out SigNoz [1], we are an open-core alternative to DataDog - based natively on OpenTelemetry. We also have a cloud product if you don't want to host yourself

[1] https://signoz.io

popalchemist · on April 3, 2024

Nobody wants open-core. The people you'd think would be into it want full MIT.

blitzar · on April 2, 2024

Is that all datadog is?

I read the horror stories, the monthly bills of 10's of thousands for one server and just assumed there was something more substantial to the product; like they did something groundbreaking or novel. I never cared enough to actually look and see what they did.

I use uptime-kuma - https://github.com/louislam/uptime-kuma - it obviously does a fraction of what these other things do but it does everything I need.

phillipcarter · on April 2, 2024

Just about every advertised Datadog alternative does maybe 10% of what Datadog can do, and likely has hundreds less pluggable integrations than Datadog. While it may be the case that it's overkill for a simple application, one of the biggest benefits to Datadog is that there's an integration for just about anything, and the product can go deep if you need it to.

The "omg my bill is out of control" issue is usually manifest from a few sources, one of the biggest is relying heavily on custom metrics added over time, and so you think you're paying X but really you end up paying 2-3X or more by the end of the year. But the tricky thing is, most of those things that cost a lot of money either have a lot of value, or held a lot of value at the time.

(I work for a Datadog competitor)

bdcravens · on April 2, 2024

For us it was the mismatch between AWS and Datadog billing (AWS bills by the second, Datadog bills by the hour, so you should only ever use Datadog for persistent instances, not high churn instances like dynamic background jobs, or else completely rearchitect your application for the benefit of a vendor)

mikeshi42 · on April 2, 2024

This is incredibly common - I've heard a company end up rearchitecting their instance type choices due to Datadog billing on a per-node basis (with some peak usage billing shenanigans). Their business model unfortunately encourages some very specific architectures which doesn't work for everyone.

moondev · on April 2, 2024

Before I had a chance to work with datadog, I generally operated Prometheus / Grafana as it's basically industry standard in k8s. The ability for an application to publish it's own often very detailed metrics and have those auto scrape is powerful.

Learning that datadog charges these as custom metrics was shocking. This opens a wormhole of allow-list or opt-in considerations, and then there is tag cardinality or even introducing a middleware like vector. It feels very backwards to spend effort on reducing observability.

If datadog was crap it would make things easier, but it really is a fantastic product, and a business after all. Prometheus integration is just so very cumbersome which is probably strategic I would imagine.

icelancer · on April 2, 2024

Correct. Same as New Relic.

I'd love an open source alternative. But there just isn't one for APM (which is our main use case). Nothing comes close. Every time I see "OpenTelemetry integration" I just close the page. Hours and hours of manual setup, code pushes, etc while New Relic installs once and works.

I assume it's the same for people who use DataDog begrudgingly.

sofixa · on April 2, 2024

> I'd love an open source alternative. But there just isn't one for APM (which is our main use case). Nothing comes close. Every time I see "OpenTelemetry integration" I just close the page. Hours and hours of manual setup, code pushes, etc while New Relic installs once and works.

Depending on the language/environment/framework, OpenTelemetry Autoinstrumentation just works. It's the new standard, and lots of working is ongoing to make it work for everything, everywhere, and even the big observability vendors are adopting it.

mikeshi42 · on April 2, 2024

I'm wondering when's the last time you tried OpenTelemetry - and which language it was in? I'm not going to say it's super mature (it's not) - but I think it's come a long way from being a ton of manual setup and it's more akin to SDKs available commercially. Admittedly we (HyperDX) do offer some wrapped OpenTelemetry SDKs ourselves to users to make it even easier - but I think the base Otel instrumentation is easy enough as it is.

anacrolix · on April 4, 2024

Otel is trash

phillipcarter · on April 3, 2024

FWIW most of OTel is pretty easy to use and set up too. OTel Operator over K8s that installs autoinstrumentation agents for 5 languages --> pretty easy onboarding.

jcims · on April 2, 2024

Datadog is a pretty amazing product and the folks that build it should be proud of what they have done. BUT it's extremely expensive and most people don't use all the features. It's like Splunk, 99% of people havent invested the time or energy to get the full value of the product they are paying for.

devneelpatel · on April 2, 2024

Oddly enough, this is why we started OneUptime in the first place. We were burned by the DataDog bill and wanted an open source observability platform ourselves.

flyingpenguin · on April 2, 2024

I imagine datadog's AWS bill is also out of control, considering all the absurd levels of queries/groupings you can do.

I used to work on a growing AWS product with tons of features that no one used.

Often when we were creating a feature, our managers would have us include tags and support for making parts of the feature optional, but make sure no parts of the feature (or the feature itself) where optional to start with. We would enable the ability to toggle the feature if "A significant enough amount of customers weighted by revenue requested it".

Also got the "Build filtering, but don't expose it unless we have to".

hhhhhhhmmmmmmm · on April 2, 2024

I'm a former datadog user from a series D tech scale-up, and yeah the horror stories of billing are true.

Grafana, Victoria Metrics or Honeycomb

remram · on April 2, 2024

I don't understand what this uses for a storage backend. Object storage? DBMS? Custom? I see references to Clickhouse in the repo...

devneelpatel · on April 2, 2024

Postgres for tx data. Clickhouse for storing logs / traces / metrics.

OJFord · on April 2, 2024

The docker compose file has clickhouse and postgres.

anonzzzies · on April 2, 2024

What does it use for integration/workflow; frontend seems theirs but backend seems not in the repos. I saw more solutions like this boasting ‘5000+’ integrations but I cannot find the code for that (I might have missed it)?

devneelpatel · on April 2, 2024

I believe this is what you're looking for: https://oneuptime.com/product/workflows

OneUptime is 100% open source, and always will be. It's a mono-repo and backend is divided into few services.

anonzzzies · on April 2, 2024

I guess I should get into the chat with you, but I meant; you integrate with, let’s say, Jira; there is no jira calling code in the repos, so how does that happen. Bit too in depth for here maybe.

devneelpatel · on April 2, 2024

Oh yes! there's no native Jira integration so far - but we have some customers who already integrate oneuptime with Jira. They integrate Jira through workflow webhooks in OneUptime - so when an issue is created in Jira, incident is created in OneUptime and all Status Page subscribers are notified.

anonzzzies · on April 2, 2024

Then I think the wording is a bit confusing for new clients. I expect, when you say that you can integrate, not that I have to build that myself.

Xcelerate · on April 2, 2024

Question for those in the observability space: do moment-in-time observations preserve all of the dimensions of the event, and if so, how do most observability platforms compress the high volume of (ostensibly) low-rank data?

carefulfungi · on April 2, 2024

There are three-ish strategies (usually employed in combination at scale).

Columnar databases are very good at compressing time series data which often has runs of repeating values that can be run length encoded, repeating deltas (store the delta not the full value), or common strings that can be dictionary encoded. So you can persist a lot of raw data with quite good compression and fast-scannability. Most commercial TSDBs are now backed by column stores. And several now tier with local SSDs for hot data and S3 for colder data.

If that's still too much data to store, you have to start throwing some away. Both sampling and materializing aggregates (and discarding the raw data) are popular techniques and can both be very reasonable trade offs.

withinboredom · on April 2, 2024

> low-rank data?

It's "low-rank" until that one day systems start shitting the bed, and you're trying to understand what is going on.

Xcelerate · on April 2, 2024

Sure, but that's fine. You're only collecting high-rank data for a short period of time, and the massive trove of historical data lets you identify what's causing those anomalies quickly.

remram · on April 2, 2024

How does this compare to the usual Grafana stack?

devneelpatel · on April 2, 2024

Some of the differences are:

Grafana started as visualization tool and has now decoupled multiple products for observability - LGTM stack (Loki for logs, Grafana for visualization, Tempo for traces, and Mimir for metrics). You need to configure and maintain multiple sub-products for a full-stack observability setup.

While grafana stack is great, OneUptime has all of these in one platform and makes it really simple to use.

We also are built natively on OpenTelmetry, use clickhouse for logs / metrics / traces storage - so queries are really fast.

scottfingerhut · on April 3, 2024

I would say to kick Grafana Cloud hard - the freemium has everything in it, with back ends, no need to manage anything. And with a trial period you get to really push it. The story that it's a lot to self manage is apples to oranges if you decide on a pure OSS approach vs. the easy cloud path.

sidcool · on April 2, 2024

I may be cynical here, but I find that all open source datadog alternatives are mostly frontend focussed with an out of the box database. And it does not scale well. It's not easy to maintain, scale, shard etc. Am I wrong?

P.S. I am all for OSS.

francoismassot · on April 2, 2024

Quickwit is an alternative with a strong focus on scalability (max we have seen is 40PB) with a decoupled compute and storage architecture. But we do only logs and traces for now.

Repository: https://github.com/quickwit-oss/quickwit Latest release: https://quickwit.io/blog/quickwit-0.8

mikeshi42 · on April 2, 2024

We're one of those OSS alternatives (HyperDX) - built on Clickhouse. While I can't say it's stupid simple to scale Clickhouse (because anything stateful is inherently hard to scale), it's orders of magnitude easier than other platforms like Elastic and gives you a lot more flexible tuning options (last co I was at, we ran that at massive scale, it was an absolute handful).

In theory, you can get away with just running a Clickhouse instance purely backed by S3 to get durability + scalability (at the cost of performance of course). It all depends on what scale you're running at and the HA/performance requirements you need.

swyx · on April 2, 2024

presumably that is the right place to start for a datadog competitor because ddog is not going to care about smol instances that arent at scale that they can’t charge a bajillion for

esafak · on April 2, 2024

But you have to be ready for day two. Nobody is going to pick an observability solution that does not scale. Some scalability is table stakes.

ChrisCooney · on April 3, 2024

Hey! If you’re looking for open source friendly with really straight forward cost, check out Coralogix.com.

Great features for logs, metrics & traces, total compatibility with open telemetry, cost optimization tools built in (DataDog leavers typically save around 50%), and much more!

Check out our site, and you can find me on LinkedIn (or indeed reply here !) if you want to ask further questions.

https://www.coralogix.com

markhalonen · on April 2, 2024

we've been pretty happy with just a Clickhouse DB and sending metrics directly from api servers to Clickhouse HTTP https://clickhouse.com/docs/en/interfaces/http . Hook up Grafana and you have a nice raw SQL (our team loves SQL) Grafana dashboard.

pdimitar · on April 2, 2024

I can see some commits mentioning telemetry but it's not at all mentioned on the GitHub README. Strange.

It looks solid and I'd try it if the need arises.

devneelpatel · on April 2, 2024

https://oneuptime.com/product/apm

nurettin · on April 2, 2024

I make apps call home and aggregate incoming pings in grafana, because some of them are behind a vpn.

devneelpatel · on April 3, 2024

We have Incoming Request monitor for this. OR you can deploy custom probes in your private network. https://oneuptime.com/docs/probe/custom-probe

nurettin · on April 4, 2024

Funny story. One customer insisted on not whitelisting anymore sites, so I had to send the pings to slack, and monitor slack to update grafana.

There is no "the solution" to site monitoring. Eventually someone will need scripting to bridge a gap. Preferably python.

kylegalbraith · on April 3, 2024

Do you not offer Arm images for the various services? For folks wanting to run this stack, I'd imagine some of them are interested in running on Arm for better cost optimization. Maybe it doesn't matter for how lightweight the services may be.

mosselman · on April 2, 2024

If you self-host, do you still need a paid plan? Are there any limitations?

devneelpatel · on April 3, 2024

No, and never will be.

100% of what we do is FOSS.

nodesocket · on April 2, 2024

If using the Helm chart to install, does it also automatically monitor the cluster that oneuptime is installed on? Didn't see the Kubernetes integration docs

devneelpatel · on April 3, 2024

Kuberetes integration is coming soon. Its on our roadmap.

matthewcford · on April 4, 2024

This is interesting; we've also been building an incident management tool, hoping to open-source it shortly, once it is ready.

htrp · on April 2, 2024

https://oneuptime.com/ also makes it a managed service to compete with datadog

Rapzid · on April 2, 2024

Lot of interesting OSS observability products coming out in recent years. One of the more impressive(and curious for many reasons) IMHO is OpenObserve: https://github.com/openobserve/openobserve .

As opposed to just a stack, they are implementing just about the whole backend shebang from scratch.

bbkane · on April 2, 2024

And (in a simple, non-TLS, local storage configuration), as a static binary OpenObserve is incredibly easy to install locally or with Ansible ( https://github.com/bbkane/shovel_ansible/blob/master/openobs... )

As a guy running this for personal infrastructure, I super appreciate the easy install and low system requirements

agilob · on April 2, 2024

Apache Skywalking is doing the same https://skywalking.apache.org/

esafak · on April 2, 2024

How does it compare to competitors, and what are the differences between the cloud offering and the open source version? Their web site barely mentions the open source part.

devneelpatel · on April 4, 2024

100% of what OneUptime does is FOSS and always will be. We dont believe in open-core.

mongrelion · on April 3, 2024

Thanks for sharing this.

Others have mentioned Signoz which I have tried but for my homelab it just feels like too much.

OpenObserve might be the solution that I have been looking for.

intelVISA · on April 2, 2024

eBPF unleashed a wave of these sorta weekend projects and I kinda love it even if the value prop is fairly minimal.