Launch HN: Orbiter (YC W20) – Autonomous data monitoring for non-engineers

generatorguy · on March 7, 2020

I work on power stations which normally have about 1000 monitored variables per turbine-generator and another 500 for the plant in general. So typically 2500 for a two unit plant.

Alarms are generated if a variable exceeds a threshold, or a binary variable is in the wrong state.

Is Orbiter something that would benefit power plants?

zhangwins · on March 7, 2020

Hey generatorguy - this is a really interesting use case so thanks for sharing. I imagine our modeling / monitoring / alerting capabilities can extend to power plants but will need to understand the data better. The common types of business and product metrics that our customers look for include user growth, cancellation rates, call failure %s, all of the above by different geos, etc. Happy to chat more if you'd like to shoot me an email (I'm winston[at]getorbiter.com)

jacques_chester · on March 8, 2020

I think some sort of anomaly detection would be useful in your case. There are a bunch of libraries floating about, I remember at least Netflix[1], Yelp and Datadog talking about them. There appears to be a really good links page available too[1]. You can also learn a lot from Forecasting Principles and Practice, which is free online[2]

I have previously pitched using a kind of SPC-for-metrics approach, with Nelson rules[3] to help surface metrics which are starting to move out of control. I think it would have the advantage over ML techniques that it's easy to understand.

My experience is that alerting thresholds are a very poor mechanism for managing systems. They just ossify past disasters and typically become noise. Alert fatigue renders them meaningless. If they're set by the manufacturer then the incentives are broken, they will favour false alerts in order to push legal responsibility onto the operator.

[0] https://github.com/Netflix/Surus

[1] https://github.com/yzhao062/anomaly-detection-resources

[2] https://otexts.com/fpp2/

[3] https://en.wikipedia.org/wiki/Nelson_rules

generatorguy · on March 9, 2020

thanks for the links.

We only create an alert if there is a problem the operator can solve, otherwise there is no point in waking them up at 3 AM, so if anything our thresholds are set as loose as possible instead of as tight as possible.

However there are many instances where the operator could be alerted earlier that the machine operation is abnormal. For example the stator windings are rated for operation up to 155 degrees C but the machine is lightly loaded for a long time, the ambient temperature is normal, and the windings are 140 degrees. No alert would be generated from the stator winding temperature but something is amiss.

I think this is the case where some ML/AI/hypeword techniques might be applicable, for the controller to know that based on half a dozen variables the expected value for other variables based on past operation.

vladsanchez · on March 22, 2020

You should take a look at http://riemann.io

jacques_chester · on March 9, 2020

I agree with focusing on actionable alerts during on-call hours. You might be able to have some kind of scheduled change in sensitivity.

One thing I've wondered in the past year is whether fuzzy logic would be useful. Your example is a really good case of linguistic variables -- "lightly loaded", "a long time", "normal temperature" and so on. These can be assembled into rules or tables that should fire more sensibly than exact threshold values.

parasj · on March 7, 2020

Not OP, but I researched scalable anomaly detection systems for power-generating assets. We collaborated with a large industrial engine manufacturer on this work. https://arxiv.org/abs/1701.07500. The key challenge customers encountered was the prevalence of false alarms that led to unnecessary service.

zhangwins · on March 7, 2020

Woah this is awesome. How did you guys resolve the false alarm issue wrt power plants?

hohloma · on March 9, 2020

There is a small company in Lund, sweden that specialized in this. Its run by a former professor of mine in uni. The basic idea is to build a model of the system and connect detectors output to it, and it will use that info to detect anomalies and filter errors to find root cause. https://www.goalart.com/ not affiliated in any way, except in already stated.

rixed · on March 9, 2020

Out of curiosity, since I'm interested in industrial monitoring: would you mind telling a bit more about the monitoring infrastructure, esp. how often are those metrics collected and what data protocols are involved in the process?

generatorguy · on March 10, 2020

I only know from my own experience and I’m essentially self taught, so I don’t know what industry norms are only what has worked for me and my customers.

The instruments and controlled devices are wired to a PLC such as Allen Bradley control logix or Schneider electric m580. The PLC generally reads the inputs, executes the program, and updates the outputs every 10ms. HMI software running on a computer such as inductive automation ignition, vtscada, wonderware, citect, etc reads data from the PLC to display to the operator and record for history. Protocols are often modbus or common industrial protocol (CIP) which is also called, or some flavor of it, the ridiculous name of Ethernet/IP, but that’s the kind of shit you get in industrial automation.

I generally set the HMI software to record my 2500 values once per second.

During testing it is common to use a data acquisition system that can sample even much faster than the PLC runs, eg 1 kHz.

dataminded · on March 7, 2020

Really excited about this.

We're very early into doing a PoC where we use DataDog/Cloudwatch for our business metrics for this specific use case. We're also looking at tracking data quality metrics. The standard BI reporting tools are very immature when it comes to alerting based on changes in data over time.

I hope at some point you consider ingesting metrics like the ops tools do. Giving you direct access to my database is going to be really challenging but I'm glad to send you what I want you to keep track of.

zhangwins · on March 7, 2020

Ah very interesting, and agree on the immaturity of alerting/time-series changes for current BI reporting tools. Would be great if you could send me more info about what you're thinking about tracking & also hear more about how the PoC you guys are thinking of. Would you mind sending me a note to winston[at]getorbiter.com?

idrism · on March 8, 2020

This is interesting. If you can deliver on this, I'm guessing you can deliver on a lot more. Figuring out what warrants an alert is a non-trivial problem, and it's in the same problem space as answering other business questions like "what is our true organic traffic".

Also, on metric drops I'm interested not just in the alerts but also in the narrowing down of what is causing the drop. For example, the first question we always ask is "could marketing blend be causing this". I imagine your ML can figure that out. You could also point out where to look, like "iOS 13 is fine, but there is a severe drop in conversion for iOS 12" or "Conversion dropped for app version 13.2 on Android".

Great stuff! I'd love to see if it works!

tixocloud · on March 8, 2020

Funny, Actually I know a startup in Edinburgh that has figured out the “true organic traffic” and they’ve used ML to fix the data for marketing attribution model.

zhangwins · on March 8, 2020

Was this ML attribution model output explainable / deterministic? I've seen some really complicated marketing attribution models in the past and hear it was something of a never-ending battle to understand and arrive at the "right" model.

tixocloud · on March 8, 2020

I believe it is explainable as I didn’t hear anything fancy about the model being built. It’s been tested and proven to cut marketing spend quite a bit while delivering the same results. A patent has also been filed.

You are spot on that sometimes we just overcomplicate models and sometimes it’s best to go with something explainable and deterministic but less accurate as opposed to more accuracy but complicated.

zhangwins · on March 8, 2020

Thanks! We'll rolling out slowly (kinda Superhuman onboarding style back in their old days) so definitely hope to get in touch with you soon :)

Also re: narrowing down what's causing the drop, that's definitely on the roadmap. We know teams have playbooks of things to check when they know something looks wrong, so we should be able to productize & automate this

slap_shot · on March 8, 2020

Congrats on the launch. This is a really interesting space that I think has a ton of potential - I'm watching pretty closely to see what comes out of it.

Have you heard of Outlier (https://outlier.ai)? Do you have any thoughts? How does Orbiter compare to Outlier?

(I haven't used Outlier but see it come up in anomaly detection discussion a lot recently).

zhangwins · on March 8, 2020

Thank you! There's definitely a lot of growth and potential in this space and we're really excited too. We're focused on intelligent monitoring and alerting for metrics that the user cares about & defines. We also automate the diagnostic playbooks that teams use today after detecting an issue (eg check data, check user segments, check geographies, etc.) Outlier seems to focus on insights and less on monitoring/alerting. They comb through data to surface 4-5 "unexpected insights" about your customers or business every day in a FB feed-type product.

ares2012 · on March 8, 2020

Hi! (Founder of Outlier.ai here) You are right, our platform is designed to produce the most important insights from massive amounts of data, without requiring human supervision/configuration. It is most useful in applications when there is too much data to set up guardrails, or the teams don't know what guardrails to create. Our typical customers are very large consumer businesses who have data spread across dozens of systems and need to ensure they never miss important emerging trends or problems.

We are not an alerting or monitoring system, so I don't think you'd use us for the same applications as Orbital. The typical users of Outlier are the business users ranging from executives to business operations who want to make sure they are asking the right questions about the business.

Orbital looks like a great product, good luck in building your business!

knightelvis · on March 8, 2020

Without looking into details of your soluton, what's the difference between your solution and Cloudwatch anomaly detection? https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

zhangwins · on March 8, 2020

Orbiter anomaly detection is for any DB (e.g. Postgres, Snowflake) and metrics that business/product teams tend to track such as transaction conversion %, user growth, add item to basket %, etc.

Amazon Cloudwatch anomaly detection is for AWS resources & apps, and covers infra metrics like resource utilization, app performance, ops health.

In terms of the anomaly detection capabilities -- both are using similar machine learning processes to detect metric issues automatically!

P.S. If you get curious about the details of our solution, we have a 2 minute video demo ;) Cheers! https://www.youtube.com/watch?v=R7P_M6j0P2A

knightelvis · on March 9, 2020

Thanks for replying. That's a good demo. However, I don't necessarily agree Cloudwatch is only for infra metrics. Theoretically, you could send any metrics to CW and leverage the anomaly detection feature. Given it aggregates data over time and you could lost granularities of your data, that's probably not a good idea for business centric data. Then I found AWS QuickSight (https://aws.amazon.com/quicksight/features-ml/?nc=sn&loc=2&d...) which seems to have a similar feature parity?

dodata · on March 7, 2020

Congrats on launching! Looks very helpful!

As a data scientist, I found that a drop in metrics was just as often due to a data pipeline issue as it was an actual business problem. This unfortunately causes business users to lose trust in the metrics quickly. How do you plan to differentiate between those two root causes of metric changes?

zhangwins · on March 7, 2020

Ah I can empathize with you here (as a former DS) -- we had incidents in the past that were data pipeline / instrumentation changes causing bad data which then caused metric drops (versus a real product issue, but they nonetheless caused a loss of confidence in data).

We think there are a number of diagnostic features that could be helpful here (to be built!). Teams today run playbooks to root cause issues when metric drops happen. We should be able to take that playbook and automate it. Say, Orbiter identifies an abnormal change in Metric X. The team is then probably analyzing sub-funnel metrics Y and Z, or looking at various dimension cuts to isolate the issue. Maybe they're also checking data quality by comparing the count of event volume vs. count of user IDs vs. count of device IDs, etc. If we run all of these diagnostic checks when Metric X drops, we could give the team insight into what we know is OK vs. not OK.

sanabriarenato · on March 7, 2020

That's really cool! Besides identifying abrupt changes in metric X, for me the most difficult part is trying to understand what caused this change in X. Great to know that you have this issue in the roadmap, but do you think it's possible to develop a model/automation that is generic enough to be used in different business ? Maybe analysing the correlation between different time series could be a way to go ?

tixocloud · on March 8, 2020

It’s definitely possible if you have the underlying data definitions so you’re not having to compare time-series across industries (it’ll be hard because every single business’ metrics could be so different based on the way the metrics themselves are setup).

Avora (https://avora.com/product/) and Thoughtspot (https://Thoughtspot.com) all have the root cause capability

arciini · on March 7, 2020

This is really cool! Our search-engine-based impressions dropped substantially in early Feb. and because we didn't have that in our main dashboards, it took us almost 2 weeks to discover that. Orbiter would've been pretty useful for that - got in touch!

zhangwins · on March 7, 2020

Thanks! Looking forward to getting connected. We've heard SEO-specific use cases come up with some of the other companies we've worked with too -- you basically need to find out the exact time that your SEO ranking saw a material change cause it's usually driven by something that shipped at that time. Otherwise takes a long time to get back the traffic from GOOG

photonios · on March 7, 2020

This sounds really cool! I've wished for something like this many times. I am mostly attracted by the fact that it would be mostly automatic. I am hoping it lives up to the hype.

Signed up for the beta. All the best!

DEADBEEFC0FFEE · on March 8, 2020

I've been talking with AppDynamics, and much of what you have said in this thread could have been said by AppD. Are you hoping to get some of their market?

longtermd · on March 9, 2020

I really like the manual configurability. In our startup, we work a lot with influencers and it's very usual for us to have strong spikes in signups (and also high/ low CVR for different quality of influencers and "strength of promotion"). This nature would a purely ML model to constantly shout alert.

arzel · on March 7, 2020

I saw y’all on PH and immediately submitted to get early access. Super excited to try it out, and congrats on the launch!

zhangwins · on March 7, 2020

Woohoo! Thanks so much. Looking forward to getting in touch soon :D

BlackJack · on March 8, 2020

Congrats on the launch! This is definitely a real, big problem.

thedrake · on March 8, 2020

This is what Maxly.coM used to do. They learned several valuable lessons and would be worth talking to one of the founders.

zhangwins · on March 8, 2020

Thanks for the tip - haven't heard of Maxly before but will do some research

tixocloud · on March 8, 2020

Congrats on the launch. We’re in an adjacent/overlapping space with drift detection/model monitoring so it’s always very exciting to see automatic data monitoring tools come into place. We’re hoping that as more and more startups come onboard the better it is for all of us. Cheers and best wishes!

zhangwins · on March 8, 2020

Thank you! We're actually a team of Canadians too (but have been living/working in SF Bay Area) :D Always great to see more applications for data science - best of wishes to you too!

djiddish98 · on March 7, 2020

Go Winston! So much better than trying to do this in Tableau