
Facebook Workplace co-founder launches downtime fire alarm Kintaba - quartz
https://techcrunch.com/2020/02/10/kintaba-incident-alerts/
======
jedberg
At Netflix we build internal tools and systems that basically did all this
stuff, so it looks interesting, out of the box.

One of the nicest things was that we use a single deployment tool for almost
all deployments, and it could insert the deployments into the timeline so we
could see every deploy both during and before the incident.

But the biggest issue was voice. We always had a conference call running for
incidents because some people would be in a place where they couldn't open
chat (driving to the office or similar) and it was always a pain to integrate
the voice data.

We got to a point where a recording of the incident call and a transcript
could be added to the incident, but during the call the Call Leader usually
had to quickly type the voice highlights into the chat.

I'd love to see a real time voice transcription as a feature. But it would
have to be pretty good to not just get in the way.

~~~
quartz
Conference line + transcription support is an awesome idea and something we
can likely add without too much trouble!

Netflix is definitely on the good end of companies with custom tooling in this
space. Would love to chat more about how you do things if you don't mind me
reaching out personally?

~~~
jedberg
Email is in my profile. Would love to chat with you!

~~~
startledmarmot
I'd also love to help out on this! We power exactly this kind of voice +
transcription + voice application platform stuff at SignalWire and it'd be rad
if there was a way we could help get this off the ground.

------
ryanjodonnell
What does it mean to be co-founder of facebook workplace? Arent you just an
employee of facebook?

~~~
duxup
I suppose it implies a leadership role where that department was somewhat
autonomous and operated a bit 'like' a founder of a startup. That's what I
would assume.

~~~
AgloeDreams
Exactly, Much like Jon Rubinstein being the inventor of the iPod or Panos
Panay being the leader of the Surface Group.

------
duxup
Man in that screenshot it LOOKS like facebook a bit... too much IMO.

Anyway it's an interesting idea. I worked in a support center for decades and
demands for updates and managing P1 type situations, giving updates to the
dozen interested parties (each of whom were competent in some ways, less so in
others), managing misinformation, and the varying politics was a constant
hassel for folks who were technical.

There were times where "oh man my phone just went down" (I unplugged it).

That's no small thing to try to deal with, with software.

~~~
quartz
Kintaba cofounder here-- thanks for the feedback!

We really wanted the interface to feel comfortable for non-technical folks who
needed to stay up to date on incidents so we focused on bringing as much of a
human element as possible into the dashboard. Hopefully we can find the right
balance of friendliness and informativeness, we certainly don't intend to
become a social network (but we would love it if we were as easy to use as
something like Slack or Discord)...

We experienced the same pain you're describing with keeping others updated and
the politics (and subpolitices and subsubpolitics) that come out of major
incidents over our careers. Luckily we also saw companies that did it right,
usually with custom built tooling. Our biggest discovery was the more open you
make the whole incident process the better everyone understands the work being
done and also the less of those insidious little subpolitical conversations
happen where facts are skewed and people are blamed for process challenges.

It's definitely a big hill to climb.

~~~
duxup
Yeah we used to do the political stuff via email updates at a place I was at.

It wasn't the worst way but also not a "system" for sure.

Half the battle with political stuff was really was defining the problem in a
way that everyone outside engineering understood, and keeping everyone updated
/ aware of what was factually (not rumors or panic) going on, what was known,
not known, what was happening, and when they would get their next update.

And that's not even the technical stuff where folks here are asking about
recording conference calls and etc ;)

------
mherdeg
Yeah feels like there's a lot of room for startups in this space (Blameless
has a great demo, too).

No one wants to rewrite PagerDuty internally -- why are people all writing
their bespoke incident management, response, review, and reporting toolchains
internally too?

~~~
jedberg
> why are people all writing their bespoke incident management, response,
> review, and reporting toolchains internally too?

Because the good ones are tightly integrated with the rest of the internal
tooling. To use a 3rd party incident management tool usually means you have to
run your operations they way they expect to get maximum value. A lot of times
its easier to build it yourself based on how you operate.

However, as the way people operate become more standardized, the third party
tools will become more useful.

~~~
mherdeg
Yeah, agree, this is probably the key sticking point: the 3rd party toolchain
doesn't do [some key thing] the stakeholders in my process have found
valuable. So I end up run some scenario analysis and find that the 3rd party
toolchain doesn't shave off enough developer time & ongoing support cost from
the rest of the operational work I'm going to have to do anyway to manage an
incident. And I load the vendor's web page every couple of months and
wistfully wish my process were uncomplicated enough to use their tools.

Here's hoping that the calculus changes either as these tools grow more robust
or, like you say, as people begin to manage their software systems in less
bespoke ways.

For example, during an incident we'd like to know "what changed between time
[X] and [Y]" (deployments, configuration, experiments, other service outages)
while we're trying to determine a root cause and fix the problem. And much
later, after the incident, we'd like to auto-compute a metric like "what is
the change success rate of [Y service / services owned by team Z]?".

This aren't complicated concepts -- it's not like we're trying to reason about
causality with machine learning to reduce time-to-resolve for incidents!
Still, our incident-management tool will really behave better if it's aware of
our what-changed-at-time-X tooling and our incidents-caused-by-changes
reporting. If this is an external tool, yikes, now our incident tool has
sprung awareness of changes and reporting?!

------
chadlavi
I know all TechCrunch articles are paid placements but this one felt the paid-
placementest in a long time.

~~~
echan00
Not true at all. My previous startup was written about and we did not pay a
dime.

------
mjayhn
Being able to take a snapshot of the chat/repo pushes/CI+CD jobs/grafanas and
everything else going on during an outage and separate it by an incident
type/tag/hostname/etc for later reference (to write RFOs, to more quickly
solve it if it occurs again, to write documentation on, etc) is something I've
wanted for awhile so this seems really interesting.

I'm sure that stuff exists (I think Datadog sort of does this, etc) but I've
yet to work anywhere that doesn't just create some #shtf-$date slack channel
which eventually gets lost in the black hole due to cost prohibition or time
required to get it going while a fire is going on.

~~~
quartz
100%. Finding ways to identify the relevant data related to an incident is
something that requires a lot of additional integrations that we're actively
working on.

One thing Kintaba is really good at right now is wiring your slack channel
directly into the activity log so it's properly attached to the incident and
ultimately the postmortem. This helps avoid that channel getting lost over
time, but there's still lots to do for sure in organizing all that data to
make it more useful after the fact (one thing we currently support is #tagging
for quick search within the log).

------
gooeyblob
Looks interesting!

A couple notes: \- the verification email went to my spam folder on Gmail \-
acknowledged is misspelled on this image
[https://kintaba.com/images/collab_splash.png](https://kintaba.com/images/collab_splash.png)

~~~
quartz
Yikes-- fixed the spelling, thank you!

Thanks for the spamboxing report. Seems gmail isn't a fan of us today...
working on it now.

~~~
ThePowerOfFuet
Get your SPF, DKIM, and DMARC up to snuff. Dmarcian is a good tool for that.
(No connection, just a happy customer.)

------
jamisteven
I can see someone like Servicenow or Splunk wanting to acquire this. Decent
idea for organizations that dont have existing frameworks around these types
of things. Where I see it not working would be in places where the use of 3rd
party apps are prevalent. We have many of these at my place of work (Finance,
70k employees), one of the top apps used by the business to make money is a
total black box, when something goes wrong with it there isnt much we can do
besides send off the core dumps to the vendor and wait for analysis, with no 2
scenarios ever being the same root cause.

------
wackget
You should know that Techcrunch URLs are blocked for anyone using uMatrix, as
per the following discussion:
[https://news.ycombinator.com/item?id=22228159](https://news.ycombinator.com/item?id=22228159)

Just making you aware, as the site causes lots of problems and is not GDPR
compliant. Top comment on the thread above: "Yahoo/Verizon is cancer and
should die in a fire."

~~~
lostlogin
Every thread has a “this website breaks x” section and usually I bypass it.
Breaking the ability to go back (iOS) is so gross and many sites seem to do
it. Why?

------
sbryant8
How does this differentiate from something like Opsgenie? They already have a
built-in incident command center that can notify everyone in an org, Slack
integration and postmortems. Would be interesting to see how all of the
features stack up to one another.

------
motakuk
There is one more concept in this field: agree with collaborative nature of
incidents and work with them in Slack directly (implemented in
[https://amixr.io](https://amixr.io))

------
lainga
With former employees from Kinja and Altaba?

------
Jestar342
> Verizon Media works with select partners that do not participate in the
> Interactive Advertising Bureau's Transparency and Consent Framework, or
> Google framework. All our foundational partners require you to manage your
> choices directly through their privacy policies. Click on each partner below
> to access their privacy policy

Yeah. <closes-tab/>

------
lucb1e
This site has some gdpr-violating we-want-it-all compulsory tracking wall it
seems, but from the comments, do I understand correctly that a "downtime fire
alarm" is a monitoring service that alerts sysadmins at home when a service is
down?

