1. No matter how good your AI, it will make mistakes. Users need ways to provide feedback or filtering to avoid bad alert fatigue. Giving users a sense of control is critical.
2. Most larger shops will have dozens of monitoring tools already generating alerts. Consider ingesting existing alerts as another algorithmic signal.
3. The real root cause of an incident often won't show up in logs. Don't assume that the earliest event in a cluster is causal.
4. The more context you can provide an operator looking at a potential incident, the better. Modern SEIM tools do an ok job here. Consider pulling in topology or other enrichment sources and matching entity names/IDs to log data.
Good luck. Contact in profile if you'd like to chat further.
Larry, Ajay and Rod here!
We're excited to share Zebrium's autonomous incident detection software. Zebrium uses unsupervised machine learning to detect software incidents and show you root cause. It's built to catch even "Unknown Unknowns" (problems you don't have alert rules built for), the FIRST time you hit them. We believe autonomous incident detection is a important tool for defeating complexity and crushing resolution time.
* Get Started *
1) Go to our website and click "Get Started Free". Enter your name, email and set a password.
2) Install our collectors from a list of supported platforms. For K8s it's a one command install. Join the newly created private Slack channel for alerts (or add a webhook for your own)
3) That's it. Automatic incident detection starts within an hour and quickly gets good. You can drill down into logs & metrics for more context if needed.
Getting started takes less than 2 minutes. It's free for 30 days with larger limits and then free forever for up to 500MB/day.
* Here's what you WON'T have to do *
Manual training, code changes, connectors, parsers, configuration, waiting, hunting, searching, alert rules, etc! It works with any app or stack.
* How It Works *
We structure all the logs and metrics we collect in-line at ingest, leverage this structure to find normal and anomalous patterns of activity, then use a point process model to identify unusually correlated anomalies (across different data streams) to auto-detect incidents and find the relevant root-cause indicators. Experience with over a thousand real-world incidents across over a hundred stacks has confirmed that software behaves in certain fundamental ways when it breaks.
It turns out that we can detect important incidents automatically, with a root-cause indicator when the logs and metrics reflect it. Zebrium works well enough that our own team relies on it, and we believe you'll want to use it, too.
If you have a functioning platform today, what percentage of events end up needing human escalation?
Asking these questions considering the early stage of your product.
The tasks the SW handles are (1) detecting when it looks like an incident should be raised, (2) gathering up all the evidence around that incident, and (3) notifying the user via Slack or other webhook. The purposes are to (a) detect unknown unknowns for which you won't have an alert rule built, and (b) reduce MTTR by having pulled together evidence of impact and root-cause into an incident report.
Re: human involvement: a human still needs to review and potentially act on the incident report. The idea is that we've (1) alerted on an incident and (b) given you a great summary / starting point.
You can provide feedback on the report to let the system know which sorts of incidents were good / ok / lame, and the system will refine future incidents based on this simple feedback mechanism.
Re data/data models: there are no app-specific/ stack-specific / user-specific rules built-in, and each user's dataset is learned/structured independently. As an example, suppose you are running postgres. We will learn that logstream structure from scratch as it comes in; there is no code that looks for the word "postgres" or "replication", for example, nor is there a built-in understanding of the timestamp formats supported by postgres.
The structuring is done de novo for each log stream, anomalies are detected in very generic ways, and the model that decides what rises to the level of an incident is abstract and works the same for any app, also without special rules.
This design is quite intentional: there won't be any pre-built rules or connectors for your application, for example, and autonomous incident detection only works if it can grok an arbitrary stack OOTB.
Here's a link to a blog that shows how the system works on a few sorts of incidents, although the UI is much prettier now with charts instead of just text in the incident report:
I hope this has answered your questions!
What does unsupervised learning mean in detail?
Is this a deep learning or a classical machine learning approach?
First is the structuring of logs: we have a four stage pipeline for structuring and each stage has greater importance depending on how many ground truth instances there are of a given event type in the dataset (these are unlabeled, of course). These stages include heuristics, reachability clustering, a naive Bayes classifier with global fitness function, and a modified LCS. When the data comes in it is laid down in tables with typed columns directly, without post-processing; later, table merges are considered asynchronously. This lets you start doing anomaly detection really well right away.
Next is anomaly detection. There are lots of dimensions we consider for AD; at the end of the day most of them boil down to some reflection of either “badness" or "rareness". We'd rather catch too many anomalies than miss one. This AD is run on both logs and metrics.
Finally, there's incident detection: here we look primarily at how much independence is there between different streams w.r.t. their anomalies, and when we see an unusually high correlation across channels, we raise an incident. Here, a naive Bayes model is used to set cutoffs, with streams of anomalies considered as point processes. You can provide feedback here for training, but it is optional.
Where is the systematic evidence that this product actually works? What's the general false positive and false negative rates in standard setup? Did you construct various failed environments and measure the quality of the reports? For this sort of thing I would expect a simulation of at least 10-20 failure environments with detailed false positive/false negative rate measurements. Right now you have a lot of cherry picked examples without any sort of systematic setup (in particular, you don't seem to talk about false positives anywhere).
When we started out, we took a stab at a model, and then collected about 50 incidents from about a dozen stacks. These were actually not of our creation, but from real-world application stacks, where the owners gave us the data and permission to use it. It was painstaking, but we were able to gather from them what comprised a valid root-cause indicator from their perspective, for each incident, and what did not.
So we collected these datasets and put them in what we call "the dojo". Then, we ran our software against it. We achieved about a 2/3 recall rate on detection+root-cause... meaning, a detection did not count as a true positive unless we also caught a root-cause indicator and put it into the incident report.
At that point, we reviewed what appeared to be systematic ways we were missing things, and improved the software substantially. This improvement process took us well north of 90% of recall on this original dataset. But of course we had not set aside a validation set and refused to look at it, so I don't know what the real recall should have been at that point - and I don't care. The data set was small enough that it was more important to learn everything possible and get it implemented, so we could get happy users, so we could get MORE data.
Getting such a well-curated dataset again is prohibitive in terms of effort at our size, and the kind of feedback we now collect automatically has huge noise built in. Some users hit "Like" when they're happy with an incident, and some complain when they're not, but most times no feedback is given at the incident level. Sometimes the feedback you get is they sign up, or they keep using you, or they pay you, or they leave. I think at scale we will have enough Like/Mute/Spam feedback at the incident level to get meaningful systematic answers from it; but we're not there yet, because this sort of feedback is generally sparse.
Regarding false positives - here, I think, what matters is this: each user has a finite amount of bandwidth / tolerance for noise. The important thing is that we not exceed that, while missing as little as possible. If they have a small environment and get three false positives and one true positive in a week, they may be happy. If they have a larger environment and get 30 false positives and 10 true positives in a week, they may be less happy; they might have been able to tolerate 10 false positives only before exceeding their bandwidth.
Thanks for the question, I could go on all day, and you've made me realize that this might make for a really interesting blog post, where I go into more detail still. I think there's a continuum in this sort of bootstrapping where you start with an art and end with a science, and do a little bit of each as you go through this awkward transition. Hit me up by email if you'd like to continue the conversation, and maybe let me pick your brain: firstname.lastname@example.org
You can see one of their talks and a great discussion at a BayLISA.org meeting.
There are a few testimonials on the website, but there are plenty of other proof points we can't attribute. Off the top of my head, here are a few that stand out:
1.) A latent LDAP server issue that would have taken down a mission-critical SaaS app at a Fortune 500 enterprise SW company. Detected and showed root-cause indicators.
2.) Two production bugs that were degrading service for a subset of users for weeks in a multi-billion-$ B2B SaaS company's production deployment. Detected and showed root-cause indicators.
3.) Multiple backend bugs degrading service in a $1B e-commerce company's production deployment. Detected and showed root-cause indicators.
4.) All OpenEBS issues that had been observed YTD in real customer deployments, replicated using Litmus by MayaData. Detected and showed root-cause indicators.
5.) Here's an unsolicited quote a devops consultant from the UK posted in our community 4 months ago:
"The data has started coming through and has picked up all the incidents I deliberately caused and a couple of other that I didn't know about.
This setup so cuts through the noise of logs to the heart of the matter that it would not be over stating the case to say that this is the future of Observability.
My question was more on the integration side. We already send logs and metrics to Datadog, so if we want to add Zebrium into the equation then we need to also send those there. I was wondering if some kind of integration would allow Zebrium to consume logs/metrics from Datadog, or just to make the integration easier. Just a thought.
In any case, I'm definitely curious to take Zebrium for a spin :)