Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Zebrium – ML that catches software incidents and shows you root cause (zebrium.com)
76 points by stochastimus 21 days ago | hide | past | web | favorite | 30 comments

Congrats on the launch. Having worked at a startup in the AIOps space, I can can offer a few suggestions.

1. No matter how good your AI, it will make mistakes. Users need ways to provide feedback or filtering to avoid bad alert fatigue. Giving users a sense of control is critical.

2. Most larger shops will have dozens of monitoring tools already generating alerts. Consider ingesting existing alerts as another algorithmic signal.

3. The real root cause of an incident often won't show up in logs. Don't assume that the earliest event in a cluster is causal.

4. The more context you can provide an operator looking at a potential incident, the better. Modern SEIM tools do an ok job here. Consider pulling in topology or other enrichment sources and matching entity names/IDs to log data.

Good luck. Contact in profile if you'd like to chat further.

csears, founder here - could not agree more with your comments. 1. we learnt early to make user feedback easy (and immediately actionable) - they can quickly "like", "mute" and "spam". Or go more granular if needed. 2. This is an insightful comment. A few of our early users gave us similar feedback, and we've been hard at work. We'll soon be releasing a mode that takes an incident signal from your incident management tool such as PagerDuty or even Slack (often people create a Slack workspace per incident), and constructs a report around it. 3&4 are good points as well. Don't disagree about enrichment, just need to stage things.

Hey folks,

Larry, Ajay and Rod here!

We're excited to share Zebrium's autonomous incident detection software. Zebrium uses unsupervised machine learning to detect software incidents and show you root cause. It's built to catch even "Unknown Unknowns" (problems you don't have alert rules built for), the FIRST time you hit them. We believe autonomous incident detection is a important tool for defeating complexity and crushing resolution time.

* Get Started *

1) Go to our website and click "Get Started Free". Enter your name, email and set a password. 2) Install our collectors from a list of supported platforms. For K8s it's a one command install. Join the newly created private Slack channel for alerts (or add a webhook for your own) 3) That's it. Automatic incident detection starts within an hour and quickly gets good. You can drill down into logs & metrics for more context if needed.

Getting started takes less than 2 minutes. It's free for 30 days with larger limits and then free forever for up to 500MB/day.

* Here's what you WON'T have to do *

Manual training, code changes, connectors, parsers, configuration, waiting, hunting, searching, alert rules, etc! It works with any app or stack.

* How It Works *

We structure all the logs and metrics we collect in-line at ingest, leverage this structure to find normal and anomalous patterns of activity, then use a point process model to identify unusually correlated anomalies (across different data streams) to auto-detect incidents and find the relevant root-cause indicators. Experience with over a thousand real-world incidents across over a hundred stacks has confirmed that software behaves in certain fundamental ways when it breaks.

It turns out that we can detect important incidents automatically, with a root-cause indicator when the logs and metrics reflect it. Zebrium works well enough that our own team relies on it, and we believe you'll want to use it, too.

Do you already have data models and ML to handle these tasks today, or are you still building data models with the aid of the clients who sign up for the service?

If you have a functioning platform today, what percentage of events end up needing human escalation?

Asking these questions considering the early stage of your product.

Hi, one of the founders here: The service is designed to not require specific data models, because that does not scale, nor does it keep up with changes in application behavior. Instead, the ML engine learns data structures, normal behavior of logs and metrics, and normal correlations between them for each app deployment on the fly. Then when things break it does a very good job of generating incidents. We make user feedback easy, so if we are "over-eager" in detecting a certain kind of incident, your response trains the ML quickly. We do improve the ML engine with experience of course (and have added some user controls), but now have dozens of applications using us, and cumulatively have over a thousand successfully detected incidents under our belts.

Hi eganist,

The tasks the SW handles are (1) detecting when it looks like an incident should be raised, (2) gathering up all the evidence around that incident, and (3) notifying the user via Slack or other webhook. The purposes are to (a) detect unknown unknowns for which you won't have an alert rule built, and (b) reduce MTTR by having pulled together evidence of impact and root-cause into an incident report.

Re: human involvement: a human still needs to review and potentially act on the incident report. The idea is that we've (1) alerted on an incident and (b) given you a great summary / starting point. You can provide feedback on the report to let the system know which sorts of incidents were good / ok / lame, and the system will refine future incidents based on this simple feedback mechanism.

Re data/data models: there are no app-specific/ stack-specific / user-specific rules built-in, and each user's dataset is learned/structured independently. As an example, suppose you are running postgres. We will learn that logstream structure from scratch as it comes in; there is no code that looks for the word "postgres" or "replication", for example, nor is there a built-in understanding of the timestamp formats supported by postgres. The structuring is done de novo for each log stream, anomalies are detected in very generic ways, and the model that decides what rises to the level of an incident is abstract and works the same for any app, also without special rules.

This design is quite intentional: there won't be any pre-built rules or connectors for your application, for example, and autonomous incident detection only works if it can grok an arbitrary stack OOTB.

Here's a link to a blog that shows how the system works on a few sorts of incidents, although the UI is much prettier now with charts instead of just text in the incident report:


I hope this has answered your questions!


What does unsupervised learning mean in detail?

Is this a deep learning or a classical machine learning approach?

There are three components to Ze that involve what you'd think of as ML. We use a few different "classical" techniques together; our focus is on keeping costs down while providing useful results on day one on even a newly-deployed custom application.

First is the structuring of logs: we have a four stage pipeline for structuring and each stage has greater importance depending on how many ground truth instances there are of a given event type in the dataset (these are unlabeled, of course). These stages include heuristics, reachability clustering, a naive Bayes classifier with global fitness function, and a modified LCS. When the data comes in it is laid down in tables with typed columns directly, without post-processing; later, table merges are considered asynchronously. This lets you start doing anomaly detection really well right away.

Next is anomaly detection. There are lots of dimensions we consider for AD; at the end of the day most of them boil down to some reflection of either “badness" or "rareness". We'd rather catch too many anomalies than miss one. This AD is run on both logs and metrics.

Finally, there's incident detection: here we look primarily at how much independence is there between different streams w.r.t. their anomalies, and when we see an unusually high correlation across channels, we raise an incident. Here, a naive Bayes model is used to set cutoffs, with streams of anomalies considered as point processes. You can provide feedback here for training, but it is optional.

Thanks for the answer, I really prefer clear statements about the methods being used over marketing-like loose labels like "ML".

One question:

Where is the systematic evidence that this product actually works? What's the general false positive and false negative rates in standard setup? Did you construct various failed environments and measure the quality of the reports? For this sort of thing I would expect a simulation of at least 10-20 failure environments with detailed false positive/false negative rate measurements. Right now you have a lot of cherry picked examples without any sort of systematic setup (in particular, you don't seem to talk about false positives anywhere).

Yeah, it's an interesting problem, bootstrapping such a thing.

When we started out, we took a stab at a model, and then collected about 50 incidents from about a dozen stacks. These were actually not of our creation, but from real-world application stacks, where the owners gave us the data and permission to use it. It was painstaking, but we were able to gather from them what comprised a valid root-cause indicator from their perspective, for each incident, and what did not.

So we collected these datasets and put them in what we call "the dojo". Then, we ran our software against it. We achieved about a 2/3 recall rate on detection+root-cause... meaning, a detection did not count as a true positive unless we also caught a root-cause indicator and put it into the incident report.

At that point, we reviewed what appeared to be systematic ways we were missing things, and improved the software substantially. This improvement process took us well north of 90% of recall on this original dataset. But of course we had not set aside a validation set and refused to look at it, so I don't know what the real recall should have been at that point - and I don't care. The data set was small enough that it was more important to learn everything possible and get it implemented, so we could get happy users, so we could get MORE data.

Getting such a well-curated dataset again is prohibitive in terms of effort at our size, and the kind of feedback we now collect automatically has huge noise built in. Some users hit "Like" when they're happy with an incident, and some complain when they're not, but most times no feedback is given at the incident level. Sometimes the feedback you get is they sign up, or they keep using you, or they pay you, or they leave. I think at scale we will have enough Like/Mute/Spam feedback at the incident level to get meaningful systematic answers from it; but we're not there yet, because this sort of feedback is generally sparse.

Regarding false positives - here, I think, what matters is this: each user has a finite amount of bandwidth / tolerance for noise. The important thing is that we not exceed that, while missing as little as possible. If they have a small environment and get three false positives and one true positive in a week, they may be happy. If they have a larger environment and get 30 false positives and 10 true positives in a week, they may be less happy; they might have been able to tolerate 10 false positives only before exceeding their bandwidth.

Thanks for the question, I could go on all day, and you've made me realize that this might make for a really interesting blog post, where I go into more detail still. I think there's a continuum in this sort of bootstrapping where you start with an art and end with a science, and do a little bit of each as you go through this awkward transition. Hit me up by email if you'd like to continue the conversation, and maybe let me pick your brain: larry@zebrium.com

As a founder in the monitoring space, and now heading up the core monitoring team at Netflix, I had a chance to work with Zebrium and have to say the technology is impressive. Unlike other anomaly detection services, they've done a lot of work to get decent incidents without too much noise completely unsupervised - this is definitely the next generation of observability and Zebrium has a clear head start in this space!

Just ran through your intro video. If it does really what it says, this is a great product. I'll have my team test this tomorrow. Good luck on your launch.

thanks Samdung. Appreciate it, and any feedback once you try it.

This is a game changer. I've met the team and they've got something special here.

You can see one of their talks and a great discussion at a BayLISA.org meeting.


Cool! How does your software compare to other similar tools like BigPanda, Moogsoft, Splunk ITSI?

Hi paridso, we tackle the problem at a more foundational level. AIops tools are designed to speed up resolution and reduce noise, but they act on a feed of alerts/incidents. So in a sense they are dependant on the quality of alerts generated by monitoring tools. And typically a human will end up drilling down into data like logs and metrics to determine root cause. Our ML acts on the raw data, and has better coverage than typical log/metrics monitoring tools (including previously unknown failure modes). It also cuts time to root cause by generating complete incident summaries.

Very cool, would love something like this. Your video gives a fairly straightforward incident response which traditional tools would work equally well on. Can you describe a situation that Zebrium does better than legacy tools? Perhaps a hypothetical unknown unknown.

Hi zumachase,

There are a few testimonials on the website, but there are plenty of other proof points we can't attribute. Off the top of my head, here are a few that stand out:

1.) A latent LDAP server issue that would have taken down a mission-critical SaaS app at a Fortune 500 enterprise SW company. Detected and showed root-cause indicators.

2.) Two production bugs that were degrading service for a subset of users for weeks in a multi-billion-$ B2B SaaS company's production deployment. Detected and showed root-cause indicators.

3.) Multiple backend bugs degrading service in a $1B e-commerce company's production deployment. Detected and showed root-cause indicators.

4.) All OpenEBS issues that had been observed YTD in real customer deployments, replicated using Litmus by MayaData. Detected and showed root-cause indicators.

5.) Here's an unsolicited quote a devops consultant from the UK posted in our community 4 months ago:

"The data has started coming through and has picked up all the incidents I deliberately caused and a couple of other that I didn't know about. This setup so cuts through the noise of logs to the heart of the matter that it would not be over stating the case to say that this is the future of Observability. Brilliant!"

Hi, these 2 blogs list some scenarios. The first has more details because we replicated these in house and can share full details: https://www.zebrium.com/blog/is-autonomous-monitoring-the-an.... https://www.zebrium.com/blog/beyond-anomaly-detection-how-in...

and although this isn't quite answering your question about unknown/unknowns, this open source project lets you inject failure modes using a chaos tool (litmus) on your own app. We had really good results catching application incidents created by these chaos test. https://github.com/zebrium/zebrium-kubernetes-demo

Nice website, folks. The 2-minute intro video does a great job presenting the value-prop. It looks like the solution detects events with a high probability of being a problem automatically via ML. Can I define my own events using custom condition criteria as well?

founder here: you certainly can. Our goal is to minimize this need for you, but any team with experience already has some problem signatures/alerts for known issues, and we've tried to make it easy to capture those. Our ML helps even this chore in one way - if you're building a signature relying on a log event - normally this is done with regexes, but you're at the mercy of a developer not changing syntax. Our ML will track these and ensure they signatures don't break if the log format changes in a future rev.

Congrats on the launch, and good luck! Looks fascinating. Looking forward to the future release that fixes the incidents as well, and just notifies us afterwards as a courtesy. =)

:) founder here: well, if there is a runbook for a known failure, we can trigger it via webhook. But auto-remediation for a previously unknown failure is of course a much different beast. Ambition for the future...

Looks really promising. Congrats on the launch. Any plans to integrate with Datadog? (or just make the transition / co-existence easier)

hi gingerlime, thank you. One thing we're working on is taking an incident signal from other tools and augmenting it with an incident report. We're starting with PagerDuty and Slack integrations as sources, but could see extending that to other APM/monitoring tools (DataDog could fit here). Of course we do have some overlap in the latter case (for logs & metrics). If you have something more specific in mind - let's connect. There's more detail on the above mentioned plan here: https://www.zebrium.com/blog/youve-nailed-incident-detection...

Thanks for sharing the details. Looks interesting. I guess if we get Datadog alerts into slack and Zebrium listening on the same channel, then we can achieve something similar to what you described for PagerDuty. Right? definitely sounds interesting. I didn't think about that aspect.

My question was more on the integration side. We already send logs and metrics to Datadog, so if we want to add Zebrium into the equation then we need to also send those there. I was wondering if some kind of integration would allow Zebrium to consume logs/metrics from Datadog, or just to make the integration easier. Just a thought.

In any case, I'm definitely curious to take Zebrium for a spin :)

Right - we will consume alerts from other tools. I see your point about consolidating collection. We'll look for opportunities like this if we can. For now, we do try to make it easy and lightweight to set up our collectors. And many of our users do have multiple collectors/agents on the same clusters. Please contact us if you'd like to give it a try.

Had them at my meetup yesterday and they presented. Super interesting tool. Zero config!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact