Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Orbiter (YC W20) – Autonomous data monitoring for non-engineers
107 points by zhangwins on March 7, 2020 | hide | past | favorite | 41 comments
Hello HN! We are Victor, Mark, and Winston, founders of Orbiter (https://www.getorbiter.com). We monitor data in real-time to detect abnormal drops in business and product metrics. When a problem is detected, we alert teams in Slack so they never miss an issue that could impact revenue or the user experience.

Before Orbiter, we were product managers and data scientists at Tesla, DoorDash, and Facebook. It often felt impossible trying to keep up with the different dashboards and metrics while also actually doing work and building things. Even with tools like Amplitude, Tableau, and Google Data Studio, we would still catch real issues late by days or weeks. This led to lost revenue and bad customer experiences (i.e. angry customers who tweet Elon Musk). We couldn't stare at dashboards all day, and we needed to quickly understand which fluctuating metrics were concerning. We also saw that our engineering counterparts had plenty of tools for passive monitoring and alerting—PagerDuty, Sentry, DataDog, etc.—but the business and product side didn’t have many. We built Orbiter to solve these problems.

Here’s an example: at a previous company, a number of backend endpoints were migrated which unknowingly caused a connected product feature in the Android shopping flow to disappear. Typically, users in that part of the shopping flow progress to the next page at a 70% rate but because of the missing feature, this rate dropped by 5% absolute. This was a serious issue but was hard to catch by looking at dashboards alone because: 1) this was just one number changing out of hundreds of metrics that change every hour, 2) this number naturally fluctuates daily and weekly, especially as the business grows, 3) it would have taken hours of historical data analysis to ascertain that a 5% drop was highly abnormal for that day. It wasn’t until this metric stayed depressed for many days that someone found it suspicious enough to investigate. All in, including the time to implement and deploy the fix, conversion was depressed for seven days costing more than $50K in reduced sales.

It can be especially challenging for the human eye to judge the severity of a changing metric; seasonality, macro trends, and sensitivity all play a role in equivocating conclusions. To solve this, we build machine learning models for your metrics that capture the normal/abnormal patterns in the data. We use a supervised learning approach for our alerting algorithm to identify real abnormalities. Then, we forecast the expected “normal” metric value and also classify whether an abnormality should be labeled as an alert. Specifically, forecasting models identify macro-trends and seasonality patterns (e.g. this particular metric is over-indexed on Mondays and Tuesdays relative to other days of the week). Classifier models determine the likelihood of true positives based on historical patterns. Each metric has an individual sensitivity threshold that we tune with our customers so the alerting conditions catch real issues without being overly noisy. Models are re-trained weekly and we take user feedback on alerts to update the model and improve accuracy over time.

Some of our customers are startups with sparse data. In these cases, it can be challenging to build a high-confidence model. What we do instead is work with our customers to define manual settings for “guardrails” that trigger alerts. For example, “Alert me if this metric falls below 70%!” or “Alert me if this metric drops more than 5% week over week”. As our customers grow and their datasets grow, we can apply greater intelligence to their monitoring by moving over to the automated modeling approach.

We made Orbiter so that it's easy for non-technical teams to set-up and use. It’s a web app, requires no eng development, and connects to existing analytics databases the same way that existing dashboard tools like Looker or a SQL editor just plug in. Teams connect their Slack to Orbiter so they get immediate notifications when a metric changes abnormally.

We anticipate that the HN community has members, teammates, or friends who are product managers, businesspeople, or data scientists that might have the problems we experienced. We’d love for you and them to give Orbiter a spin. Most importantly, we’d love to hear your feedback! Please let us know in the thread, and/or feel free to send us a note at hello@getorbiter.com. Thank you!

I work on power stations which normally have about 1000 monitored variables per turbine-generator and another 500 for the plant in general. So typically 2500 for a two unit plant.

Alarms are generated if a variable exceeds a threshold, or a binary variable is in the wrong state.

Is Orbiter something that would benefit power plants?

Hey generatorguy - this is a really interesting use case so thanks for sharing. I imagine our modeling / monitoring / alerting capabilities can extend to power plants but will need to understand the data better. The common types of business and product metrics that our customers look for include user growth, cancellation rates, call failure %s, all of the above by different geos, etc. Happy to chat more if you'd like to shoot me an email (I'm winston[at]getorbiter.com)

I think some sort of anomaly detection would be useful in your case. There are a bunch of libraries floating about, I remember at least Netflix[1], Yelp and Datadog talking about them. There appears to be a really good links page available too[1]. You can also learn a lot from Forecasting Principles and Practice, which is free online[2]

I have previously pitched using a kind of SPC-for-metrics approach, with Nelson rules[3] to help surface metrics which are starting to move out of control. I think it would have the advantage over ML techniques that it's easy to understand.

My experience is that alerting thresholds are a very poor mechanism for managing systems. They just ossify past disasters and typically become noise. Alert fatigue renders them meaningless. If they're set by the manufacturer then the incentives are broken, they will favour false alerts in order to push legal responsibility onto the operator.

[0] https://github.com/Netflix/Surus

[1] https://github.com/yzhao062/anomaly-detection-resources

[2] https://otexts.com/fpp2/

[3] https://en.wikipedia.org/wiki/Nelson_rules

thanks for the links.

We only create an alert if there is a problem the operator can solve, otherwise there is no point in waking them up at 3 AM, so if anything our thresholds are set as loose as possible instead of as tight as possible.

However there are many instances where the operator could be alerted earlier that the machine operation is abnormal. For example the stator windings are rated for operation up to 155 degrees C but the machine is lightly loaded for a long time, the ambient temperature is normal, and the windings are 140 degrees. No alert would be generated from the stator winding temperature but something is amiss.

I think this is the case where some ML/AI/hypeword techniques might be applicable, for the controller to know that based on half a dozen variables the expected value for other variables based on past operation.

You should take a look at http://riemann.io

I agree with focusing on actionable alerts during on-call hours. You might be able to have some kind of scheduled change in sensitivity.

One thing I've wondered in the past year is whether fuzzy logic would be useful. Your example is a really good case of linguistic variables -- "lightly loaded", "a long time", "normal temperature" and so on. These can be assembled into rules or tables that should fire more sensibly than exact threshold values.

Not OP, but I researched scalable anomaly detection systems for power-generating assets. We collaborated with a large industrial engine manufacturer on this work. https://arxiv.org/abs/1701.07500. The key challenge customers encountered was the prevalence of false alarms that led to unnecessary service.

Woah this is awesome. How did you guys resolve the false alarm issue wrt power plants?

There is a small company in Lund, sweden that specialized in this. Its run by a former professor of mine in uni. The basic idea is to build a model of the system and connect detectors output to it, and it will use that info to detect anomalies and filter errors to find root cause. https://www.goalart.com/ not affiliated in any way, except in already stated.

Out of curiosity, since I'm interested in industrial monitoring: would you mind telling a bit more about the monitoring infrastructure, esp. how often are those metrics collected and what data protocols are involved in the process?

I only know from my own experience and I’m essentially self taught, so I don’t know what industry norms are only what has worked for me and my customers.

The instruments and controlled devices are wired to a PLC such as Allen Bradley control logix or Schneider electric m580. The PLC generally reads the inputs, executes the program, and updates the outputs every 10ms. HMI software running on a computer such as inductive automation ignition, vtscada, wonderware, citect, etc reads data from the PLC to display to the operator and record for history. Protocols are often modbus or common industrial protocol (CIP) which is also called, or some flavor of it, the ridiculous name of Ethernet/IP, but that’s the kind of shit you get in industrial automation.

I generally set the HMI software to record my 2500 values once per second.

During testing it is common to use a data acquisition system that can sample even much faster than the PLC runs, eg 1 kHz.

Really excited about this.

We're very early into doing a PoC where we use DataDog/Cloudwatch for our business metrics for this specific use case. We're also looking at tracking data quality metrics. The standard BI reporting tools are very immature when it comes to alerting based on changes in data over time.

I hope at some point you consider ingesting metrics like the ops tools do. Giving you direct access to my database is going to be really challenging but I'm glad to send you what I want you to keep track of.

Ah very interesting, and agree on the immaturity of alerting/time-series changes for current BI reporting tools. Would be great if you could send me more info about what you're thinking about tracking & also hear more about how the PoC you guys are thinking of. Would you mind sending me a note to winston[at]getorbiter.com?

This is interesting. If you can deliver on this, I'm guessing you can deliver on a lot more. Figuring out what warrants an alert is a non-trivial problem, and it's in the same problem space as answering other business questions like "what is our true organic traffic".

Also, on metric drops I'm interested not just in the alerts but also in the narrowing down of what is causing the drop. For example, the first question we always ask is "could marketing blend be causing this". I imagine your ML can figure that out. You could also point out where to look, like "iOS 13 is fine, but there is a severe drop in conversion for iOS 12" or "Conversion dropped for app version 13.2 on Android".

Great stuff! I'd love to see if it works!

Funny, Actually I know a startup in Edinburgh that has figured out the “true organic traffic” and they’ve used ML to fix the data for marketing attribution model.

Was this ML attribution model output explainable / deterministic? I've seen some really complicated marketing attribution models in the past and hear it was something of a never-ending battle to understand and arrive at the "right" model.

I believe it is explainable as I didn’t hear anything fancy about the model being built. It’s been tested and proven to cut marketing spend quite a bit while delivering the same results. A patent has also been filed.

You are spot on that sometimes we just overcomplicate models and sometimes it’s best to go with something explainable and deterministic but less accurate as opposed to more accuracy but complicated.

Thanks! We'll rolling out slowly (kinda Superhuman onboarding style back in their old days) so definitely hope to get in touch with you soon :)

Also re: narrowing down what's causing the drop, that's definitely on the roadmap. We know teams have playbooks of things to check when they know something looks wrong, so we should be able to productize & automate this

Congrats on the launch. This is a really interesting space that I think has a ton of potential - I'm watching pretty closely to see what comes out of it.

Have you heard of Outlier (https://outlier.ai)? Do you have any thoughts? How does Orbiter compare to Outlier?

(I haven't used Outlier but see it come up in anomaly detection discussion a lot recently).

Thank you! There's definitely a lot of growth and potential in this space and we're really excited too. We're focused on intelligent monitoring and alerting for metrics that the user cares about & defines. We also automate the diagnostic playbooks that teams use today after detecting an issue (eg check data, check user segments, check geographies, etc.) Outlier seems to focus on insights and less on monitoring/alerting. They comb through data to surface 4-5 "unexpected insights" about your customers or business every day in a FB feed-type product.

Hi! (Founder of Outlier.ai here) You are right, our platform is designed to produce the most important insights from massive amounts of data, without requiring human supervision/configuration. It is most useful in applications when there is too much data to set up guardrails, or the teams don't know what guardrails to create. Our typical customers are very large consumer businesses who have data spread across dozens of systems and need to ensure they never miss important emerging trends or problems.

We are not an alerting or monitoring system, so I don't think you'd use us for the same applications as Orbital. The typical users of Outlier are the business users ranging from executives to business operations who want to make sure they are asking the right questions about the business.

Orbital looks like a great product, good luck in building your business!

Without looking into details of your soluton, what's the difference between your solution and Cloudwatch anomaly detection? https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

Orbiter anomaly detection is for any DB (e.g. Postgres, Snowflake) and metrics that business/product teams tend to track such as transaction conversion %, user growth, add item to basket %, etc.

Amazon Cloudwatch anomaly detection is for AWS resources & apps, and covers infra metrics like resource utilization, app performance, ops health.

In terms of the anomaly detection capabilities -- both are using similar machine learning processes to detect metric issues automatically!

P.S. If you get curious about the details of our solution, we have a 2 minute video demo ;) Cheers! https://www.youtube.com/watch?v=R7P_M6j0P2A

Thanks for replying. That's a good demo. However, I don't necessarily agree Cloudwatch is only for infra metrics. Theoretically, you could send any metrics to CW and leverage the anomaly detection feature. Given it aggregates data over time and you could lost granularities of your data, that's probably not a good idea for business centric data. Then I found AWS QuickSight (https://aws.amazon.com/quicksight/features-ml/?nc=sn&loc=2&d...) which seems to have a similar feature parity?

Congrats on launching! Looks very helpful!

As a data scientist, I found that a drop in metrics was just as often due to a data pipeline issue as it was an actual business problem. This unfortunately causes business users to lose trust in the metrics quickly. How do you plan to differentiate between those two root causes of metric changes?

Ah I can empathize with you here (as a former DS) -- we had incidents in the past that were data pipeline / instrumentation changes causing bad data which then caused metric drops (versus a real product issue, but they nonetheless caused a loss of confidence in data).

We think there are a number of diagnostic features that could be helpful here (to be built!). Teams today run playbooks to root cause issues when metric drops happen. We should be able to take that playbook and automate it. Say, Orbiter identifies an abnormal change in Metric X. The team is then probably analyzing sub-funnel metrics Y and Z, or looking at various dimension cuts to isolate the issue. Maybe they're also checking data quality by comparing the count of event volume vs. count of user IDs vs. count of device IDs, etc. If we run all of these diagnostic checks when Metric X drops, we could give the team insight into what we know is OK vs. not OK.

That's really cool! Besides identifying abrupt changes in metric X, for me the most difficult part is trying to understand what caused this change in X. Great to know that you have this issue in the roadmap, but do you think it's possible to develop a model/automation that is generic enough to be used in different business ? Maybe analysing the correlation between different time series could be a way to go ?

It’s definitely possible if you have the underlying data definitions so you’re not having to compare time-series across industries (it’ll be hard because every single business’ metrics could be so different based on the way the metrics themselves are setup).

Avora (https://avora.com/product/) and Thoughtspot (https://Thoughtspot.com) all have the root cause capability

This is really cool! Our search-engine-based impressions dropped substantially in early Feb. and because we didn't have that in our main dashboards, it took us almost 2 weeks to discover that. Orbiter would've been pretty useful for that - got in touch!

Thanks! Looking forward to getting connected. We've heard SEO-specific use cases come up with some of the other companies we've worked with too -- you basically need to find out the exact time that your SEO ranking saw a material change cause it's usually driven by something that shipped at that time. Otherwise takes a long time to get back the traffic from GOOG

This sounds really cool! I've wished for something like this many times. I am mostly attracted by the fact that it would be mostly automatic. I am hoping it lives up to the hype.

Signed up for the beta. All the best!

I've been talking with AppDynamics, and much of what you have said in this thread could have been said by AppD. Are you hoping to get some of their market?

I really like the manual configurability. In our startup, we work a lot with influencers and it's very usual for us to have strong spikes in signups (and also high/ low CVR for different quality of influencers and "strength of promotion"). This nature would a purely ML model to constantly shout alert.

I saw y’all on PH and immediately submitted to get early access. Super excited to try it out, and congrats on the launch!

Woohoo! Thanks so much. Looking forward to getting in touch soon :D

Congrats on the launch! This is definitely a real, big problem.

This is what Maxly.coM used to do. They learned several valuable lessons and would be worth talking to one of the founders.

Thanks for the tip - haven't heard of Maxly before but will do some research

Congrats on the launch. We’re in an adjacent/overlapping space with drift detection/model monitoring so it’s always very exciting to see automatic data monitoring tools come into place. We’re hoping that as more and more startups come onboard the better it is for all of us. Cheers and best wishes!

Thank you! We're actually a team of Canadians too (but have been living/working in SF Bay Area) :D Always great to see more applications for data science - best of wishes to you too!

Go Winston! So much better than trying to do this in Tableau

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact