Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: SigNoz (YC W21) – Open-source alternative to DataDog
226 points by pranay01 on Feb 9, 2021 | hide | past | favorite | 83 comments
Hi HN,

Pranay and Ankit here. We’re founders of SigNoz ( https://signoz.io ), an open source observability platform. We are building an open-core alternative to DataDog for companies that are security and privacy conscious, and are concerned about huge bills they need to pay to SaaS observability vendors.

Observability means being able to monitor your application components - from mobile and web front-ends to infrastructure, and being able to ask questions about their states. Things like latency, error rates, RPS, etc. Better observability helps developers find the cause of issues in their deployed software and solve them quickly.

Ankit was leading an engineering team, where we became aware of the importance of observability in a microservices system where each service depended on the health of multiple other services. And we saw that this problem was getting more and more important, esp. in today’s world of distributed systems.

The journey of SigNoz started with our own pain point. I was working in a startup in India. We didn’t use application monitoring (APM) tools like DataDog/NewRelic as it was very costly, though we badly needed it. We had many customers complaining about broken APIs or a payment not processing - and we had to get into war room mode to solve it. Having a good observability system would have allowed us to solve these issues much more quickly.

Not having any solution which met our needs, we set out to do something about this.

In our initial exploration, we tried setting up RED (Rate, Error and Duration) and infra metrics using Prometheus. But we soon realized that metrics can only give you an aggregate overview of systems. You need to debug why these metrics went haywire. This led us to explore Jaeger, an open source distributed tracing system.

Key issues with Jaeger were that there was no concept of metrics in Jaegers, and datastores supported by Jaeger lacked aggregation capabilities. For example, if you had tags of “customer_type: premium” for your premium customers, you couldn’t find p99 latency experienced by them through Jaeger.

We found that though there are many backend products - an open source product with UI custom-built for observability, which integrates metrics & traces, was missing.

Also, some folks we talked to expressed concern about sending data outside of boundaries - and we felt that with increasing privacy regulations, this would become more critical. We thought there was scope for an open source solution that addresses these points.

We think that currently there is a huge gap between the state of SaaS APM products and OSS products. There is a scope for open core products which is open source but also supports enterprise scale and comes with support and advanced features.

Some of our key features - (1) Seamless UI to track metrics and traces (2) Ability to get metrics for business-relevant queries, e.g. latency faced by premium customers (3) Aggregates on filtered traces, etc.

We plan to focus next on building native alert managers, support for custom metrics and then logs ( waiting for open telemetry logs to mature more in this). More details about our roadmap here ( https://signoz.io/docs/roadmap )

We are based on Golang & React. The design of SigNoz is inspired by streaming data architecture. Data is ingested to Kafka and relevant info & meta-data is extracted by stream processing. Any number of processors can be built as per business needs. Processed data is ingested to real-time analytics datastore, Apache Druid, which powers aggregates on slicing and dicing of high dimensional data. In the initial benchmarks we did for self-hosting SigNoz, we found that it would be 10x more cost-effective than SaaS vendors ( https://signoz.io/blog/signoz-benchmarks/ )

We’ve launched this repo under MIT license so any developer can use the tool. The goal is to not charge individual developers & small teams. We eventually plan on making a licensed version where we charge for features that large companies care about like advanced security, single sign-on, advanced integrations and support.

You can check out our repo at https://github.com/SigNoz/signoz We have a ton of features in mind and would love you to try it and let us know your feedback!




Congrats on the launch! It’s always nice to see alternatives in this space.

I just have a couple of observations:

> Industry trusted Kafka & Druid to handle enterprise scale. No scaling pains. Ever.

From my (limited) experience, Kafka and Druid are not exactly simple pieces of infrastructure for most shops. Often requiring significant effort to scale and maintain.

Also, in the past I’ve had some pains supporting those self-hosting my open source projects, and just wanted to give some friendly suggestions:

- A quickstart guide plus a “Production tips” article would be really helpful for those self-hosting.

- A troubleshooting guide would help reduce common support requests.

- Creating a chat group or a forum can reduce the load as users might help each other out.

It’s mostly about small things that can help save you time and effort, while making it easier for people to adopt the project.

Besides that, I think a lot of the value DataDog provides is in the form of integrations with pretty much every other service out there. We use plenty of these at my day job and it’s particularly useful to connect PagerDuty/Slack to the monitoring system. Maybe these features would help you drive adoption over time, and enable more use cases too.


Thanks for your suggestions on better ways to support self hosting! I agree we need to do a much better job here.

We chose Kafka and Druid because: 1. Any company which reaches a decent scale invariably uses some form of Kafka. And it is a trusted system which scales upto huge scale. 2. Community adoption and support. When choosing datastore, we also evaluated Apache Pinot & Clickhouse, but Druid seemed to have the best community. Also, it was proven to use at scale in places like Lyft

I agree though that these are not simple systems, and may be too much for smaller orgs. We are also evaluating supporting simpler datastores, but that would depend on what the community demands. Our architecture is modular so we are not strictly tied to druid and we can support other datastores if there is interest.

I agree with your point around integrations. That is one of the moats of DataDog in my opinion. Agree to the usefulness of integrations for PagerDuty/Slack. I have added an issue for this - https://github.com/SigNoz/signoz/issues/21#issue-804860212

Though we are hoping being an open source projects, our community would be able to create integrations. Have answered this in more detail in another comment - https://news.ycombinator.com/item?id=26080530


What was wrong with clickhouse ?


Nothing wrong there. If enough users want, we can add clickhouse also


First of all: I love the idea, effort and everything in general. So take my comment lightly.

Datadog is big because they shipped a gazillion integrations across I don't know how many products.

Anytime I see a "alternative to Datadog" I think: so you are going to have an agent and integrations page that integrates with everything from HAproxy to Kafka to the full AWS and Azure API's and etc. etc. etc?


Thanks. You make an interesting point. That is actually something we are constantly asked by our users. I guess our approach would be to prioritise developing of integrations based on community demand, and as we mature as a project - we would possibly have integrations contributed by community also.

Though one thing which is making things a bit simpler for us is the increasing maturity of opentelemetry ( https://opentelemetry.io/ ) It is an instrumentation library which supports many languages and frameworks, and by supporting Opentelemetry we get at least instrumentation for many languages and framework in one go.


Kafka and Druid are expensive and complicated components to run for an open core biz trying to help people save money against Datadog. This guarantees a lot of people won't take advantage of the "open" part of your open core but maybe that doesn't matter for your business anyway.

I can't speak to Druid but I'm always puzzled when I hear about Kafka being used for metrics. Most metrics are timestamped and also support being calculated in ways that support out of order handling.

It's true that people are being gouged on storage markup by monitoring companies but I don't think this particular approach is the solution. Obsessing over storage and querying costs isn't a good starting point for a startup so maybe driving good habits (stop collecting so much junk, keep it around less frequently, etc.) is a better route to help people save quiche. Either way good luck!


I’d totally agree that the operating costs (speaking eng time, which is more expensive than machines) of Kafka+ZK alone is quite high.

If a company is not already using Kafka, they wouldn’t want to maintain it “just” to have a self hosted APM system.

If I could make a recommendation to the developers of this system, it would be to focus on the interface with the streaming platform, before the implementation of using Kafka to support that interface.

Ideally, one should be able to plug in and out the queueing system of preference.

This will help adoption, and avoid coupling the success of your project to the implementation and success of Kafka.


Even if you maintain Kafka for business logic, you don't want to run your observability in the same Kafka cluster, because then when the business Kafka goes down, how will you debug it?


Correct, Ideally monitoring stack should be outside the blast radius of other applications. Will handling another Kafka cluster (probably smaller than business Kafka) be a pain for the team given the team already knows managing one business Kafka. What do you think?


I completely agree with you. For companies not already using Kafka, this will ask for a big commitment to self-host Kafka.

You mentioned a great approach. Queueing system as a plugin. Thanks


Kafka has a terrible reputation, but once you get familiar with it, it only occasionally lives up to that reputation, and oftentimes outperforms expectations by quite a bit.


I was pretty much surprised to see the results too. A single node Kafka with 2GB as xmx value, was ingesting at 4500 events/sec (around 1MB/s) on a single partition.

I blogged my experiments with SigNoz's scale at https://signoz.io/blog/signoz-benchmarks/. Hoping to get better in fine-tuning configs and blogging.


I think the concerns raised in this thread are less regarding raw throughput, and more about (1) the complexity of the typical production Kafka deployment (2) the arguably unnecessary, highly complex ecosystem around Kafka that you have to pay people or companies to use effectively, (3) the history of problems regarding data loss with ZK/Kafka, caused by leadership election bugs.


Exactly right. In my personal experience, Kafka's reputation for data loss and other mishaps is well-earned. Some of them are well explained by Jepsen tests.


hmm..I get your point. I searched for Kafka alternatives for a bit before including it on our stack. Though, I couldn't find something more adopted by all. It would be good to know a few Kafka alternatives you prefer which can handle equivalent production scale?


I agree with my sibling comment, and reiterate my cousin comment that you've replied to (commenting here to complete this sub-tree).

Queuing technologies will come and go, IMO it's better to focus on the interface, and allow people to swap in whatever implementation they prefer and are accustomed to. It also benefits you in the long-term too, because an application that is less-coupled to a particular external dependency will be easier to test.

Some examples of queuing tech that's deployed successfully at scale: Redis Streams, RabbitMQ, Amazon's SQS. Since this is written in Go, you could even offer an in-memory, channel-oriented stream implementation, with no external dependencies.

Not one of these is universally better than Kafka: each offers a set of trade-offs, but a very similar interface from SigNoz's point of view.

For SigNoz's hosted/tenant-based solution, it might absolutely make more sense to use Kafka. But self-hosted users bring different trade-offs to the table, and might prefer to use another solution.

Strategically, can write/maintain the plugin for Kafka (very similar to how you operate right now, except it leaves the door open to more plugins existing in the future), and encourage community contributions for other tech. Or, when you're big enough, you might want to employ people to maintain those plugins too, since they're good for adoption.


really liked the way you put things to clarity. Thanks for these inputs and suggestions, will definitely think harder on this.


Have an interface for a queuing system and support other things, not just Kafka. Ideally, you want a default/dev instance to ship with something super simple, zero setup and maybe in-memory - but allowing you to swap-in kafka or something more capable as needed.


That's an interesting point. Curious, would you use a project which supports a simple/in-memory datastore, but not anything which would be useful in production environment? Do you think that easy to get running and setup in dev environment valuable for adoption - even if it won't work in prod?

I am trying to understand - what would be a good way to prioritise.


> Do you think that easy to get running and setup in dev environment valuable for adoption

Yes, it'll make a big difference to adoption. If step one of your setup instructions are "provision a Kafka cluster", then you are going to lose 90% of people right there.

Ideally, your dev install is super simple and has a built-in in-memory queue thing. The key here, is to make it as simple as possible to get started. Once people have tried it out, and become invested in you, then you can say "for production scale, use Kafka instead of FastQ/SimpleQue/Whatever".

They key to that second step, is to have your product abstract the queue functionality it needs, into an interface that it uses to talk to the queue - allowing people to swap out queue backends with a simple configuration change.

So, make it simple to get started - and simple to scale up when you decide to.


hmm..got your point. We shall definitely look into other queuing system to be integrated as interface. Trying to understand better, what's a super simple dev setup like (to get the adoption)?

Right now, we can run SigNoz with all components including Kafka and Druid in 4GB memory supporting around 200 events/sec. Though, will need to check whether this micro setup passes a run of a few days.


What you have now isn't too bad, but figure out if you can get it down to one single command. Have a look at what netdata does (https://learn.netdata.cloud/docs/agent/packaging/installer#a...) - this is a single command to install, works really well and is super quick to get started on a single node.


Redis is like this. It runs in-memory by default, but can be trivially configured to write to disk / be persistent between server sessions. Many credit this feature as a catalyst to its adoption.


AWS's Kinesis, GCP's PubSub, whatever Azure has.


Azure has Azure Service Bus (a "full" messaging system, with subscriptions, topics, routing, AMQP 1.0 etc), and also Azure Storage Queues (a very simple queueing system where a client polls a single queue for messages).


Thanks, but won't restricting to a specific cloud service reduce adoption?


I was just outlining what the messaging options on Azure were, for the parent poster.


Guess your were luckier? YMMV, I've found Kafka generally lives up to its terrible reputation -- and even when it doesn't, it's all somehow more difficult than its initial appeal. I certainly agree with others that inclusion of Kafka in an open-source package like this would discourage me from using it.


I don't think it was luck, I just continued to learn about the mistakes I was making, until I resolved the problems people typically bail upon encountering.


Curious, is there an alternative to Kafka which would be more easier for you to adopt?



We had checked out both these projects. Our view was that RedPanda was still an early project ( ~1.5K stars) and Pulsar was very similar to Kafka, and Kafka was more well known compared to Pulsar


some options: - https://pulsar.apache.org/ - the many systems based on https://en.wikipedia.org/wiki/Advanced_Message_Queuing_Proto... - cloud-specific queues (SNS Kinesis et al)


Kafka does not fill the same use cases as these suggestions, this may be part of the trouble you were experiencing!

My use of Kafka was as a "system of record", and attaching connectors to create views into the data from there.

I could replay a Kafka topic into a MongoDB, run some analysis, and destroy the MongoDB instance.


TameAntelope - sorry I should've phrased it differently, not "lucky" but more skillful as you say.


I should say I had a team backing me up, and I lost many nights (and a handful of weekends) because Kafka didn't do what we expected.

The journey to feeling good with Kafka was difficult, but I was too stubborn to let us give up. :)


Nice thoughts, A few other users also pointed this out.

We observed enterprise and other Observability SaaS vendors have some scripts and controllers to keep running these components. We plan to open-source that too. As you rightly pointed out running OSS needs man hours and we will try to remove those frictions.

Also when working with Prometheus and Jaeger, we observed people anyhow have to use Kafka to handle scale and mostly OSS are good at start but become pretty complicated at handling scale. Eg, Prometheus long term storage solution is Cortex which itself is difficult to manage. In that case, Kafka should be better beast to handle than multiple moving components inside Cortex. We built SigNoz as a scalable alternative inspired from stream processing architecture.

We will also be proving sampling strategies including tail-based sampling to retain important data and not unnecessarily clogging disks.


I'm a committer on Apache Druid and generally a big fan of observability. I'm glad that you found Druid useful in building this!

A tip, if you aren't already doing it: with metric and trace data, it helps a ton to set up partitioning and sorting according to the query patterns you expect. Timeseries databases usually do this out of the box, because they can make assumptions about your query patterns, but general purpose databases like Druid usually need an extra step or two. Some references:

https://druid.apache.org/docs/latest/ingestion/index.html#pa...

https://twitter.com/gianmerlino/status/1287134114844270592


Thanks for the tips. Agree, we need to fine tune our druid setup and make it more performant. If its Ok, can I reach out to you on twitter DM to get some specific advice?


How is this different than Opstrace?

https://news.ycombinator.com/item?id=25991485


As we understand, both of us are taking a very different approach. Opstrace is removing the operational burden of running existing open source projects like Cortex & Loki, while we are building a new observability platform including the UI.

We are focusing on making the experience seamless like existing SaaS tools rather than stitching together disparate tools. We are more focused on observability with traces rather than only metrics and logs - and support things like custom aggregates on traces. We believe that going from metrics to traces to find the exact root cause will be increasingly more important


I guess YC is diversifying their bets :-)


I think PostHog is a more relevant comparison: https://posthog.com


Yes, PostHog is one of projects we really like.

They are also taking a similar approach of providing great open source alternative to existing SaaS tools.

Though we are in very different domains - PostHog primarily deals with product analytics, while we focus more on application monitoring like finding application latency of your deployed applications, finding error rates in APIs, etc.

Our product will be useful for devops engineers while PostHog is for product managers & digital marketing manager


Posthog looks like OSS MixPanel


nit: public github repo != f/oss.


Actually we do have a fully FOSS version: https://github.com/PostHog/posthog-foss

Main repo contains some enterprise code.


This is awesome, can't wait to try it out. I run engineering for a healthcare startup, where HIPAA requirements prevent us from using many SaaS products since we cannot risk PII leakage to a third party. Assuming this works for us, if you set up GitHub sponsorship, it would take me 15 seconds to convince management to financially support your project.


Thanks!

This is exactly the sort of use cases we had in mind. Would love to work closely with you to help you in any way. If possible, can you drop me a note on pranay at signoz dot io


Any thoughts around the viability of open-core/self-hosted monitoring, when that also entails bringing the burden of scaling/monitoring your monitoring solutions in-house?

Do you feel that the market of people who need DataDog or Splunk or Lightstep for their scale but can't afford it is large enough to sustain this model? Or is this targeted at smaller shops where cost overrides other concerns?


Great question! Our aim is to make the self-hosting of monitoring/observability systems so simple that people would prefer it compared to sending everything to SaaS vendors. We think that the current open source solutions are disparate systems ( like prom, jaeger) and thats what makes it difficult to manage. Of course, scaling Kakfa/Druid is also not trivial, but this can be managed by providing better scripts and controllers to manage the complexity.

The market we are primarily targeting is customers who see that they are paying huge (storage) price to Datadog/Lightstep and would prefer to have things in house. Self hosting also becomes more important for users who prefer data to not leave their network boundaries - either due to privacy or security concerns


"We found that though there are many backend products - an open source product with UI custom-built for observability, which integrates metrics & traces, was missing."

Ummm... https://grafana.com/products/cloud/features/


+1 for Grafana together with Loki and Prometheus. This saved us so much trouble. We thought about using DD first but the costs were so intransparent.


Grafana Tempo for the traces looks interesting too! Especially now that it doesn't require Elastic anymore


Curious, do you use self-host Prom+Loki?


I do it with my clients, but we're not using a lot of clusters.


Grafana, for long, has been used to monitor time-series data and recently has been moving towards observability (including traces and logs). We are different in quite a few fronts.

1. There are specific observabilty specific UI widgets like serviceMap, SLOs and error budgets, I don't know whether Grafana provides it now. Also, last I used Grafana, linking and moving from one dashboard to another is still a pain. You can get a better idea of how different observability UI can get from Grafana by looking into LightStep demo.

2. We can run aggregated on filtered traces. Eg, I can get 99th percentile response time of a tag say payment_channel. am afraid this can be extracted from traces by Grafana.

3. SigNoz is easily extendible by adding your stream processing application to slice n dice data in your own way


Looks great! I'll set up an instance and play with it this week to take a look :)

Only minor nitpick: You README first describes deploying on Kubernetes in the "getting started" section and then links to the docker deployment guide in the documentation section. An overview with "you can deploy on docker or Kubernetes", with subsections/links for each one, would be great, especially since it would immediately show that you don't need a full k8s cluster to get started.


Cool. Hit me up on our slack community if you face any issues.

Good point regarding README. We certainly need to do a better job at it. Will update it soon.


How do you compare to opstrace that has also launched recently?

https://news.ycombinator.com/item?id=25991485

Another comparison I'm interested in is Microsoft's Application Insights. What is your value prop over their offering?


Reg. Opstrace, as I understand, they are taking a very different approach than us. Have answered this in an earlier comment - https://news.ycombinator.com/item?id=26079637

Regarding Application Insights, I have not used the product - so don't have much idea about detailed features. But generally application monitoring tools provided by cloud vendors like MSFT, AMZN, etc. are very tied to that particular cloud - and are not as advanced as independent APM product like DataDog. Also, some users prefer to keep monitoring independent of cloud vendors so that its easier to change cloud vendors and have a multi-cloud strategy


Best of fortune to the SigNoz team! This seems like an area where many different solutions are being tried and maybe this will be the right choice for some.

Here is the CNCF Landscape for Observability products like the compared DataDog. Many of the products listed are partial components that would go into an overall solution (i.e. Beats or Graphana) or are specific to a particular cloud (i.e. Amazon CloudWatch does AWS or onprem).

https://landscape.cncf.io/card-mode?category=monitoring&grou...


Thanks! We are aware of the CNCF landscape. As you mentioned, most of the products are point solutions which need to be combined to build an end to end solution. What we found was that combining different point solution tools can be non-trivial and some times they don't talk well with each other - making correlation between different tools difficult.

For example, what were the traces responsible when p99 latency of a service crossed a threshold. This would be non trivial to do if traces and metrics are in different systems. And that's why solutions like DataDog are popular as they provide a single pane of view. Our motivation is to make such a 'single pane of view' tool in open-source.


I think what is missing is a tool that suggest you how to set hardward parameters (RAM, CPU) and configurations settings (n. workers etc...) based on usage metrics and it tells you when you need to scale servers.


Great point. To start off we shall provide different hardware configs like micro, small, medium, large, xlarge with the scale that they can handle.

We soon plan to emit metrics from different components of SigNoz and setup autoscaling of different components. Druid has already put some thought in autoscaling. Checkout https://druid.apache.org/docs/latest/configuration/index.htm... and https://www.adaltas.com/en/2019/07/16/auto-scaling-druid-wit...


That's what monitors are for, at least in datadog that's how it would work. "Tell me when available workers drops below x"


Yeah, but it doesn’t tell how how to set up Postgres parameters or Rails server workers or optimal thread size based on usage metrics. There are tons of parameters to config


agree, as Ankit mentioned, there are lots more that can be done here


Given recent events around licensing, such as Elastic moving to the SSPL, choosing an MIT licence is certainly bold! Do you require a CLA from contributors?


Thanks. No, we don't require a CLA from contributors. Though honestly speaking, we have not given lots of thought around CLAs - as we are still pretty young as a project.


Off-topic but notice you’re a Loom user from the demo video you created. Just wanted to say thank you for recording with us! (co-founder)


Loom is awesome!


Maybe a basic question, I have hundreds of VM in GCP that runs Jupyter notebooks, currently Google Cloud monitors memory and CPU for me, can I use your application to monitor Jupyter service and Docker inside my VMs? What is the benefit over custom Google Clouds monitoring metrics?


Not super familiar with Jupyter, but it seems like a web application on top of python. Currently you are only getting infra metrics like memory & CPU. With SigNoz, you can get application metrics also - like p90 latency, etc, which are the slowest endpoints ( if Jupyter exposes different end points), and error rates, etc.


TIL what APM stands for.

I've been perusing this space for a while and landed on swagger-stats for monitoring my API but it leaves a lot to be desired. Looking forward to trying signoz.

Good luck!


Thanks, feel free to reachout to us if you have any concerns or want to discuss in general. We are always eager to learn and help.


What are your plans on supporting open telemetry?


We do support opentelemetry. Our current instrumentation instruction are in OpenTelemetry and our stack also uses otel collector


what about an ELK stack?


Hey, I am one of the maintainers of SigNoz. ELK is tightly coupled to Elastic which may not be the ideal database to handle opentelemetry data. We wanted to be more of a platform where we can provide different DBs as plugins. Users can also build their own usecases by building more stream processing applications.

On the other hand, Druid powers analytical queries on data and is efficient in handling high-dimensional data. Many companies use Druid at scale (https://druid.apache.org/druid-powered).

Also Jaeger, a distributed tracing tool, provides plugin for cassandra, elastic, badger, etc. Some users found limitation in running fast aggregation of filtered traces. With Druid we can now search by annotations(without need of service name) and get aggregates on filtered traces, like p99 of version=xyz filters.


Leaving a note here so I can come back and visit to try this out. I was recently looking some new monitoring services and so I like to try this and see how it goes.


thanks. let us know if you need any help in setting things up - just ping us on our slack community




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: