
Show HN: Zebrium – ML that catches software incidents and shows you root cause - stochastimus
https://www.zebrium.com
======
csears
Congrats on the launch. Having worked at a startup in the AIOps space, I can
can offer a few suggestions.

1\. No matter how good your AI, it will make mistakes. Users need ways to
provide feedback or filtering to avoid bad alert fatigue. Giving users a sense
of control is critical.

2\. Most larger shops will have dozens of monitoring tools already generating
alerts. Consider ingesting existing alerts as another algorithmic signal.

3\. The real root cause of an incident often won't show up in logs. Don't
assume that the earliest event in a cluster is causal.

4\. The more context you can provide an operator looking at a potential
incident, the better. Modern SEIM tools do an ok job here. Consider pulling in
topology or other enrichment sources and matching entity names/IDs to log
data.

Good luck. Contact in profile if you'd like to chat further.

~~~
Ajs1
csears, founder here - could not agree more with your comments. 1\. we learnt
early to make user feedback easy (and immediately actionable) - they can
quickly "like", "mute" and "spam". Or go more granular if needed. 2\. This is
an insightful comment. A few of our early users gave us similar feedback, and
we've been hard at work. We'll soon be releasing a mode that takes an incident
signal from your incident management tool such as PagerDuty or even Slack
(often people create a Slack workspace per incident), and constructs a report
around it. 3&4 are good points as well. Don't disagree about enrichment, just
need to stage things.

------
stochastimus
Hey folks,

Larry, Ajay and Rod here!

We're excited to share Zebrium's autonomous incident detection software.
Zebrium uses unsupervised machine learning to detect software incidents and
show you root cause. It's built to catch even "Unknown Unknowns" (problems you
don't have alert rules built for), the FIRST time you hit them. We believe
autonomous incident detection is a important tool for defeating complexity and
crushing resolution time.

 __* Get Started __*

1) Go to our website and click "Get Started Free". Enter your name, email and
set a password. 2) Install our collectors from a list of supported platforms.
For K8s it's a one command install. Join the newly created private Slack
channel for alerts (or add a webhook for your own) 3) That's it. Automatic
incident detection starts within an hour and quickly gets good. You can drill
down into logs & metrics for more context if needed.

Getting started takes less than 2 minutes. It's free for 30 days with larger
limits and then free forever for up to 500MB/day.

 __* Here 's what you WON'T have to do __*

Manual training, code changes, connectors, parsers, configuration, waiting,
hunting, searching, alert rules, etc! It works with any app or stack.

 __* How It Works __*

We structure all the logs and metrics we collect in-line at ingest, leverage
this structure to find normal and anomalous patterns of activity, then use a
point process model to identify unusually correlated anomalies (across
different data streams) to auto-detect incidents and find the relevant root-
cause indicators. Experience with over a thousand real-world incidents across
over a hundred stacks has confirmed that software behaves in certain
fundamental ways when it breaks.

It turns out that we can detect important incidents automatically, with a
root-cause indicator when the logs and metrics reflect it. Zebrium works well
enough that our own team relies on it, and we believe you'll want to use it,
too.

~~~
eganist
Do you already have data models and ML to handle these tasks _today,_ or are
you still building data models with the aid of the clients who sign up for the
service?

If you have a functioning platform today, what percentage of events end up
needing human escalation?

Asking these questions considering the early stage of your product.

~~~
Ajs1
Hi, one of the founders here: The service is designed to not require specific
data models, because that does not scale, nor does it keep up with changes in
application behavior. Instead, the ML engine learns data structures, normal
behavior of logs and metrics, and normal correlations between them for each
app deployment on the fly. Then when things break it does a very good job of
generating incidents. We make user feedback easy, so if we are "over-eager" in
detecting a certain kind of incident, your response trains the ML quickly. We
do improve the ML engine with experience of course (and have added some user
controls), but now have dozens of applications using us, and cumulatively have
over a thousand successfully detected incidents under our belts.

------
lalaland1125
One question:

Where is the systematic evidence that this product actually works? What's the
general false positive and false negative rates in standard setup? Did you
construct various failed environments and measure the quality of the reports?
For this sort of thing I would expect a simulation of at least 10-20 failure
environments with detailed false positive/false negative rate measurements.
Right now you have a lot of cherry picked examples without any sort of
systematic setup (in particular, you don't seem to talk about false positives
anywhere).

~~~
stochastimus
Yeah, it's an interesting problem, bootstrapping such a thing.

When we started out, we took a stab at a model, and then collected about 50
incidents from about a dozen stacks. These were actually not of our creation,
but from real-world application stacks, where the owners gave us the data and
permission to use it. It was painstaking, but we were able to gather from them
what comprised a valid root-cause indicator from their perspective, for each
incident, and what did not.

So we collected these datasets and put them in what we call "the dojo". Then,
we ran our software against it. We achieved about a 2/3 recall rate on
detection+root-cause... meaning, a detection did not count as a true positive
unless we also caught a root-cause indicator and put it into the incident
report.

At that point, we reviewed what appeared to be systematic ways we were missing
things, and improved the software substantially. This improvement process took
us well north of 90% of recall on this original dataset. But of course we had
not set aside a validation set and refused to look at it, so I don't know what
the real recall should have been at that point - and I don't care. The data
set was small enough that it was more important to learn everything possible
and get it implemented, so we could get happy users, so we could get MORE
data.

Getting such a well-curated dataset again is prohibitive in terms of effort at
our size, and the kind of feedback we now collect automatically has huge noise
built in. Some users hit "Like" when they're happy with an incident, and some
complain when they're not, but most times no feedback is given at the incident
level. Sometimes the feedback you get is they sign up, or they keep using you,
or they pay you, or they leave. I think at scale we will have enough
Like/Mute/Spam feedback at the incident level to get meaningful systematic
answers from it; but we're not there yet, because this sort of feedback is
generally sparse.

Regarding false positives - here, I think, what matters is this: each user has
a finite amount of bandwidth / tolerance for noise. The important thing is
that we not exceed that, while missing as little as possible. If they have a
small environment and get three false positives and one true positive in a
week, they may be happy. If they have a larger environment and get 30 false
positives and 10 true positives in a week, they may be less happy; they might
have been able to tolerate 10 false positives only before exceeding their
bandwidth.

Thanks for the question, I could go on all day, and you've made me realize
that this might make for a really interesting blog post, where I go into more
detail still. I think there's a continuum in this sort of bootstrapping where
you start with an art and end with a science, and do a little bit of each as
you go through this awkward transition. Hit me up by email if you'd like to
continue the conversation, and maybe let me pick your brain: larry@zebrium.com

------
dgildeh
As a founder in the monitoring space, and now heading up the core monitoring
team at Netflix, I had a chance to work with Zebrium and have to say the
technology is impressive. Unlike other anomaly detection services, they've
done a lot of work to get decent incidents without too much noise completely
unsupervised - this is definitely the next generation of observability and
Zebrium has a clear head start in this space!

------
samdung
Just ran through your intro video. If it does really what it says, this is a
great product. I'll have my team test this tomorrow. Good luck on your launch.

~~~
Ajs1
thanks Samdung. Appreciate it, and any feedback once you try it.

------
robius
This is a game changer. I've met the team and they've got something special
here.

You can see one of their talks and a great discussion at a BayLISA.org
meeting.

[https://www.youtube.com/watch?v=gNiWtoxJ9iM](https://www.youtube.com/watch?v=gNiWtoxJ9iM)

------
paridiso
Cool! How does your software compare to other similar tools like BigPanda,
Moogsoft, Splunk ITSI?

~~~
Ajs1
Hi paridso, we tackle the problem at a more foundational level. AIops tools
are designed to speed up resolution and reduce noise, but they act on a feed
of alerts/incidents. So in a sense they are dependant on the quality of alerts
generated by monitoring tools. And typically a human will end up drilling down
into data like logs and metrics to determine root cause. Our ML acts on the
raw data, and has better coverage than typical log/metrics monitoring tools
(including previously unknown failure modes). It also cuts time to root cause
by generating complete incident summaries.

------
zumachase
Very cool, would love something like this. Your video gives a fairly
straightforward incident response which traditional tools would work equally
well on. Can you describe a situation that Zebrium does better than legacy
tools? Perhaps a hypothetical unknown unknown.

~~~
Ajs1
Hi, these 2 blogs list some scenarios. The first has more details because we
replicated these in house and can share full details:
[https://www.zebrium.com/blog/is-autonomous-monitoring-the-
an...](https://www.zebrium.com/blog/is-autonomous-monitoring-the-anomaly-
detection-you-actually-wanted). [https://www.zebrium.com/blog/beyond-anomaly-
detection-how-in...](https://www.zebrium.com/blog/beyond-anomaly-detection-
how-incident-recognition-drives-down-mttr)

~~~
Ajs1
and although this isn't quite answering your question about unknown/unknowns,
this open source project lets you inject failure modes using a chaos tool
(litmus) on your own app. We had really good results catching application
incidents created by these chaos test. [https://github.com/zebrium/zebrium-
kubernetes-demo](https://github.com/zebrium/zebrium-kubernetes-demo)

------
firefly77
Nice website, folks. The 2-minute intro video does a great job presenting the
value-prop. It looks like the solution detects events with a high probability
of being a problem automatically via ML. Can I define my own events using
custom condition criteria as well?

~~~
Ajs1
founder here: you certainly can. Our goal is to minimize this need for you,
but any team with experience already has some problem signatures/alerts for
known issues, and we've tried to make it easy to capture those. Our ML helps
even this chore in one way - if you're building a signature relying on a log
event - normally this is done with regexes, but you're at the mercy of a
developer not changing syntax. Our ML will track these and ensure they
signatures don't break if the log format changes in a future rev.

------
forgingahead
Congrats on the launch, and good luck! Looks fascinating. Looking forward to
the future release that _fixes_ the incidents as well, and just notifies us
afterwards as a courtesy. =)

~~~
Ajs1
:) founder here: well, if there is a runbook for a known failure, we can
trigger it via webhook. But auto-remediation for a previously unknown failure
is of course a much different beast. Ambition for the future...

------
gingerlime
Looks really promising. Congrats on the launch. Any plans to integrate with
Datadog? (or just make the transition / co-existence easier)

~~~
Ajs1
hi gingerlime, thank you. One thing we're working on is taking an incident
signal from other tools and augmenting it with an incident report. We're
starting with PagerDuty and Slack integrations as sources, but could see
extending that to other APM/monitoring tools (DataDog could fit here). Of
course we do have some overlap in the latter case (for logs & metrics). If you
have something more specific in mind - let's connect. There's more detail on
the above mentioned plan here: [https://www.zebrium.com/blog/youve-nailed-
incident-detection...](https://www.zebrium.com/blog/youve-nailed-incident-
detection-what-about-incident-resolution)

~~~
gingerlime
Thanks for sharing the details. Looks interesting. I guess if we get Datadog
alerts into slack and Zebrium listening on the same channel, then we can
achieve something similar to what you described for PagerDuty. Right?
definitely sounds interesting. I didn't think about that aspect.

My question was more on the integration side. We already send logs and metrics
to Datadog, so if we want to add Zebrium into the equation then we need to
_also_ send those there. I was wondering if some kind of integration would
allow Zebrium to consume logs/metrics from Datadog, or just to make the
integration easier. Just a thought.

In any case, I'm definitely curious to take Zebrium for a spin :)

~~~
Ajs1
Right - we will consume alerts from other tools. I see your point about
consolidating collection. We'll look for opportunities like this if we can.
For now, we do try to make it easy and lightweight to set up our collectors.
And many of our users do have multiple collectors/agents on the same clusters.
Please contact us if you'd like to give it a try.

------
sekka1
Had them at my meetup yesterday and they presented. Super interesting tool.
Zero config!

