
Slack was down - wgjordan
https://status.slack.com/
======
dpix
I hope they change the status page from "Incident" to "Outage". Too often
lately have I seen status pages (mainly Github) reporting incidents when a
feature or site is down completely. Seems like their SLAs are influencing the
reporting of their uptime...

~~~
pthomas551
SLAs and maybe even company politics. Incidents are politicized at a lot of
companies. Even if the official rhetoric says "when it comes to incidents we
don't blame people and there are no politics" the reality is often the
complete opposite.

~~~
birdyrooster
Perhaps an "outage" label causes the incident to trigger a more widely
attended post-mortem meeting and so they are careful not to over-use it.

~~~
blaser-waffle
"Incident" could mean several things.

Outages mean down, badly, and require RFOs. In my experience working at an ISP
we only did any sort of "blame-game" if customers demanded an RFO. We still
did post-mortems but usually in a more gentle, roundtable way.

------
Jedd
Heh. Slack isn't _down_ per se -- on their status page they do say there are
only problems with 2 / 10 components.

Messaging and Connections.

~~~
quickthrower2
That means down. What the f' else do they do?

~~~
Jedd
Quite.

Perhaps I should have been less subtle than opening with a 'Heh', and made it
clear the claim that _only_ the messaging & connection parts were failing was
tongue-in-cheek.

This page's title is "Slack is down" but Slack's status page at the time was
modestly conceding that "Something's not quite right".

~~~
interestica
It's interesting because the sarcasm came through for me and only because you
said "they do say" rather than "they say" \- a subtle difference that is
probably odd to convey to non-English speakers.

~~~
labster
Oh, that _is_ interesting. I understood the sarcasm there as well but it's
difficult to say why it is present. I guess because "do" implies a second,
unsaid category "do not". I wonder if there's a Language Log entry on this.

------
fabianhjr
Since no one can connect there are no messages to deliver and therefor no
message errors therefor everything but connections is up and working. /s (The
Slack Status Page)

------
mepiethree
how do people at Slack communicate to fix an outage when Slack is down?

~~~
manquer
Ideally your DR and emergency response systems should be not hosted on your
infra and have circular dependencies . i.e.Amazon should not be hosting status
page assets on S3 as they were.

At the very least they should be hosted isolated of your production service
components at all layers, so both don't go down at the same time.

For slack my guess is their internal slack is not hosted on slack prod and
hopefully also does not share components below L4 as well ( DNS, IP , domain,
routing etc )

~~~
jeffbee
If the disaster involves your connectivity from within your infra to without,
it could be advantageous to have backup communications hosted internally. For
e.g. an internal IRC for panic situations. The advantages of this might vary
depending on whether your response happens from the office, home, or wherever.
Pandemic might have changed the equation.

------
jshevek
From what I have seen on [https://downdetector.com](https://downdetector.com)
today, I suspect this will prove to at least partly be on the AWS end.

~~~
ipsum2
XBox Live and YouTube probably don't run on AWS, but also experienced a peak.
This is a neat site, thanks for sharing it.

------
samcheng
Looks like it's back up, at least partially. They're making progress...

It's crazy how easy it is to depend on Slack. Email collaboration feels
downright archaic. What a great product!

~~~
burgerzzz
Seriously, I just realized how much it's apart of my workflow as an engineer
with the different integrations we've setup.

------
samrohn
If anyone is curious about their deployment process
[https://slack.engineering/deploys-at-slack-
cd0d28c61701?gi=e...](https://slack.engineering/deploys-at-slack-
cd0d28c61701?gi=efefc4866bb1) Seems hard for a new deployment to cause a
global outage since they do a phased deploy. Could be some network glitch
causing this issue. We never know until they come up with an update.

~~~
dangwu
Everywhere I’ve worked has done phased deployments. Shit still goes wrong all
the time. Also, I would wager that most outages are not due to code
deployments, but live config changes.

~~~
samrohn
Makes sense. I recollect some of the postmortem reports I read where live
config changes triggered cascading outages.

------
juancampa
Reminder: have a secondary form of real-time communication for your team (e.g.
Telegram).

What is everyone using?

~~~
olyjohn
Disagree. You will spread out the communication and cause problems. People
will use their preferred client instead of the one that everybody should be
using as the primary. You also have to manage accounts and permissions for
both. Unless you like people sending information around to their personal
accounts, which I'm sure a lot of places do.

E-Mail is good enough to let people know that something like Slack is down.
It's close enough to real-time to be used as a backup for something like this,
and everybody already has an e-mail address.

My company already has Slack, HipChat and Teams, and it's a communications
nightmare. People are constantly confused and asking for which chat client one
manager uses, while others don't know where to go for the support channels.

Maybe this sounds like old-school IT, but the new-school "set everything up
ad-hoc using your personal shit" just sets you up for disaster down the road.

~~~
BryantD
You couldn’t do this with Telegram, but if you were using Hipchat or Teams or
Mattermost or Discord as your backup, you could just keep the instance locked
down most of the time. Open up permissions when Slack is down.

~~~
zenexer
You can do that with Telegram if you have a bunch of centrally-managed groups;
simply mute them until you need them. Admins can unmute them as necessary.

That being said, I don’t think I’d want Telegram to be my failover. They have
a tendency to experience regional issues with great frequency, and if you’re
in a lot of active chats, things start to get buggy.

------
wgjordan
slack.com returning 503 Service Unavailable.

See
[https://status.slack.com/2020-05/87ad1d12e36fcf0c](https://status.slack.com/2020-05/87ad1d12e36fcf0c)
for incident details:

> [May 12, 4:53 PM PDT] Users have reported general performance issues such
> message sending failures and timeouts. We’re working to get things back to
> normal as quickly as possible and will provide an update shortly.

------
_eht
Freenode is up. FYI.

~~~
kabacha
so is matrix.org, wait a second — you could have your own matrix server or
even multiple ones! Centralized connections are destined to fail.

------
franciser
Just a reminder, SLA doesn't apply to you if you are a free user. Folks, stop
complaining about their downtime if you didn't even pay for the service. It's
just like free food, don't complain about the taste if you get it for free.

------
manquer
This is a big problem, for teams using Slack for their own DevOps- monitoring
and ChatOps. DR plans have to start thinking how to get all the integrations
working with multiple communication providers to handle this type of scenario

------
benatkin
Now might be a good time to add new custom emoji, because that service doesn't
appear to be down.

[https://cultofthepartyparrot.com/](https://cultofthepartyparrot.com/)

~~~
EdwardDiego
I love how one horny kakapo became a legend, never change Internet. (His
lovelorn antics also raised a bunch of donations to kakapo conservation when
it went viral, ka pai, Sirocco)

------
pmccarren
I'd appreciate a thorough postmortem on this one to satisfy my curiosity.

------
quickthrower2
Had to email the team to use Zoom. Ugh! Will anyone turn up I wonder :-)

------
foobarbecue
When I first got a 503, the status site was all green. I checked back for at
least 4 minutes and it was still green. Finally it started to change. Is that
sort of lag normal?

~~~
dmitrygr
It is manually updated, by humans. Humans have high latency. It is not "agile"
or "modern" or "hip" to have automated status testing/reporting. A human had
to go and change that status page.

~~~
taurath
When you're a public company the risk of your automated system saying
something that invokes more fear than is necessary gets a lot higher.

~~~
_jal
Yep. I would be surprised to learn of non-infrastructure companies that wired
their public outage notice directly to automated monitors with no human
confirmation.

------
slajax
Does IRC still exist?

~~~
slajax
Freenode, here I come.

~~~
lordraj
You should try matrix, on riot.im

------
faCeti0us
It's a cogent network issue...

~~~
jpxw
Hm, I don’t see anything on [https://ecogent.cogentco.com/network-
status](https://ecogent.cogentco.com/network-status)

~~~
faCeti0us
Our cogent support rep said there's an outage.

"We are currently experiencing an outage on our network. We are working to
resolve this as quickly as possible. Please check
[http://status.cogentco.com](http://status.cogentco.com) for updates on this
event."

------
MeetingsBrowser
No one on my team can send messages at all. Anyone else having any success?

------
koverda
having trouble sending messages, some aren't making it through

------
zenexer
Any chance we’ll get a post mortem on this? I’ve got a bet to win.

------
dmitrygr
api.slack.com is returning 503s

503 Service Unavailable No server is available to handle this request.

And thanks to the electron-ness of the app, on the mac you cannot even drag
the window - the titlebar is drawn by javascript and without it, it cannot be
dragged

~~~
tonyaiken
I can drag it just fine with 503 screen. I’m on macOS Catalina slack 4.5.0

------
jorblumesea
Oh dear: [https://imgur.com/a/5aQkRvv](https://imgur.com/a/5aQkRvv)

