It’s a fun problem to solve and one I’ve come across before when trying to alert on your monitoring tool being down, but slightly different when it’s your product.
Hopefully interesting if you’ve hit similar puzzles before.
Yeah I found this really off putting: it’s not possible for you to have several goals that are all ‘paramount’, and the word ‘seamless’ adds nothing in every place it appears!
I wish it didn’t turn me off the content as much as it does but it’s very jarring.
Don't these people already have the option of running open-source tools like Netflix's Dispatch or Monzo's Response? If they have the open-source options why would they pay for the hosted services?
That's for posting this! I'm one of the engineers on the team who built On-call, so if you have any questions about how we did it (like how we handle redundancy, observability, etc) then I'd be well up for answering.
It's been an intense 9 months of work but think anyone on-call is going to really love this.
Not the author but a colleague of Milly's: it's been awesome watching the team build this and everything they've learned along the way.
I think these features are a great example of how AI can fit alongside expert human work without necessarily displacing it, just helping people with the tasks they'd otherwise be distracted by doing. Writing incident summaries is a great example of work that should never be done by a human but is valuable and should happen, but AI can entirely remove the need for anyone to consider.
There's loads of opportunities for other products to add features like these and this post gives a bit of colour about how!
This sounds really awesome, I will note that I put this data stack together by myself in about 1 week when we were just ten people in the company.
Obviously very different resource constraints than Meta, so worth considering which situation you may be closer to when picking an implementation plan.
Data like this allows us to be extremely customer focused and help direct investment for the business, such as what features we build on and when.
We also use the same data pipeline to power a lot of our data product features which customers pay us for.
So it’s extremely worthwhile as an investment for us. It’s also why we have about five people hired into data adjacent roles, as it’s so key to us running the business correctly.
Out of curiosity, would running DBT that outputs to a reporting schema on a Postgres read replica work? Or as a startup do you already have too much data for that?
That requires us to do some expensive cross-joining of every action ever taken in an incident from messages sent to the channel to GitHub PRs being merged. We could make this incremental and optimise it for performance but using BigQuery by default means we don't need to worry yet and can leave the optimisations for when we're bigger and the engineering resource we'd dedicate wouldn't detract as much from customer-focused work.
It’s a fun problem to solve and one I’ve come across before when trying to alert on your monitoring tool being down, but slightly different when it’s your product.
Hopefully interesting if you’ve hit similar puzzles before.
reply