
Slack – Degraded service affecting multiple features - kimi
https://status.slack.com/2019-06/9f63d8e30ee85f46
======
biztos
Just yesterday I was musing that if I were King of the (World|Company) I'd
want an open-source Slack-alike that I could just drop into the Cloud of my
choice and operate entirely within my private network, subject to my own
access control just like other internal services, and with full access to all
message histories in whatever database-like thing it uses in its Cloud. Sure,
I'd still have a SPOF but it's game over anyway if my Cloud goes dark.

Is there such a project, and if so does it have any traction in the real
world?

(Oh and ditto for every damn thing Atlassian owns. Yikes.)

~~~
thg
[https://mattermost.com/](https://mattermost.com/)

~~~
asah
We use this at my company - perfectly reasonable UI, don't know about the
APIs/integrations, which I assume are way behind Slack...

~~~
lwhalen
The Mattermost API is a superset of Slack's. "Most" integrations "just work",
the rest need a minimum of tweaking to DTRT.

~~~
zild3d
Mattermost API being a superset of Slack's seems to be mostly true, but
looking at the features our Slack app uses I don't see block kit
([https://api.slack.com/block-kit](https://api.slack.com/block-kit)) or
message scheduling. If we were to integrate with Mattermost it doesn't look
like it'd be a ton of work though

Have to say Mattermost's API docs are so much easier to read than Slack's.

~~~
icelancer
Slack's API documentation and handling of accounts/bots is so obscure. I've
never seen anything that bad from a major company.

------
nielsole
Best thing is that some messages go through despite returning 500s. This leads
to our integrations spamming our companies channels with redundant messages
while retrying. :/

~~~
cabaalis
Ouch, that one hurts. I don't know slack API but the contract expectation in
the call should absolutely be "500 means the message failed to send"

Tells me there's some subsystem throwing the error, not the actual messaging
(or, if intermittent, something wrong with the distribution.) What do you want
to bet it's some kind of new analytics?

~~~
bdonlan
It's not possible to always ensure that you report whether an operation
succeeded. Consider a timeout - you waited 5 seconds for the backend to
respond and heard nothing - was it, or will it be successful? You don't know.

Typically a 500 should be considered as a "may or may not have succeeded", and
steps taken to deduplicate operations if necessary.

------
yyyk
Is it just me, or do those outages seem far more common lately? Or maybe this
is happening just as often as it used to, but now we're growing more dependent
on cloud-based software so this is reported more prominently?

~~~
rnicholus
both - have a look at Slack's status history. They seem to have an outage of
some sort every month

------
lqet
Can someone explain to me why any company should outsource something as
critical as internal communication to a company that is 5 years old? The video
messaging left aside, is it really _that_ better than time-tested solutions
like email or just running an IRC server?

~~~
Thaxll
IRC is pretty much useless since you don't have history. Comparing Slack and
IRC are like comparing TCP and HTML5. Why IRC is pretty bad in enterprise:

\- no audio / video

\- no history

\- no SSO / AD integration

\- no rich features like links, pictures, upload docs ect ...

\- easy bot integration with APIs, I can have Datadog pushing things to Slack
( alerts, graph... )

\- no archiving channels on the go, Slack is very useful when you have an
incident and you create a temp channel and then archive it when the incident
is closed.

~~~
gerdesj
matrix does all that and can be self hosted - ours is.

~~~
steve19
Does the opensource matrix server scale? I may be wrong but I got the
impression it was more an example implementation and was not used by Matrix in
production?

~~~
Aaronn
There is no closed source Matrix server implementation that I'm aware of.
Synapse is the server used in production for matrix.org, the French government
deployments, and tons of other servers including my own. It is somewhat
resource intensive but it can scale using workers.

There is of course several other server implementations in progress but none
are fully ready yet.

------
exabrial
One problem with Slack we have in the finance world is that _literally
everything_ employees say on company assets needs to be recorded and readily
accessible for audit (No this is not an invasion of privacy, you're on company
time on company systems, we don't peer into your personal phone or anything).

Unfortunately, Slack does not do this very well. I'd really like to know if
there's another service that _does_ do this well.

~~~
fancy_pantser
Zulip and a very big disk.

~~~
exabrial
Looked at this, appears to make it difficult by default:
[https://zulip.readthedocs.io/en/latest/production/security-m...](https://zulip.readthedocs.io/en/latest/production/security-
model.html)

~~~
rishig
My guess is most of the major chat players handle this pretty easily -- it's a
non-starter for many potential users if you don't. Zulip stores all data
indefinitely (including message edits, etc) in its default configuration.

------
pmlnr
There are reasons why email is async and supposed to be decentralized, with
priority based fallback options. (The priority in mx records). Most email
servers try to deliver to fallbacks, and even if that fails, will try to
deliver for _days_.

For all people who think slack can replace email, think about these
safeguards.

~~~
thejohnconway
Isn't it better in nearly every situation to know something has failed
immediately, rather than get a delivery failure notice a couple of days later?

~~~
pmlnr
If it's a temporary outage, eg. network loss of a datacenter, email is not
lost. It will be delivered with a delay, tried over and over again, until the
timeout limit is reached. Due to the async nature the delay is not a problem.

------
tamalsaha001
It is Friday evening somewhere!

~~~
miyuru
it is Friday evening in South Asia

------
kevcampb
If it's any consolation, their status board has actually been reporting errors
(hi cloudflare)

~~~
jgrahamc
You're implying we don't update the status board when there are errors?

~~~
kevcampb
The recent outage was very very delayed. I stand by my comment.

Also don't get me wrong, you do a great service. It's just a pet peeve that it
seems invariably status pages are a lie.

~~~
kevcampb
Let me quote from the previous HN discussion

”"”sauldcosta 4 days ago [-]

We use downdetector.com because status pages tend to take up to an hour or so
to update, if they ever do. reply

jgrahamc 4 days ago [-]

1042 UTC First alert of global traffic problem 1057 UTC Internal group chat
room up and running 1102 UTC Status page updated So, first alert to status
page was 20 minutes.”””

In those 20m we had repeatedly checked your status page, realised it was our
issue and started pulling engineers to deal with it as per procedure. People
are on call, it's highly disruptive.

Surely you knew within those 20m that something was up?

In the end we realised it wasn't our issue because we checked Twitter.

Edit - added quotes

~~~
jgrahamc
Yep. I tend to agree that we could have gone faster. The difficulty is that
getting clear information out fast when you are dealing with a difficult
situation is hard.

But, I guess we could have put some status up quicker.

~~~
kevcampb
Thanks for the acknowledgement. Appreciate it's hard, false positives etc.
Glad you see our PoV.

~~~
jgrahamc
I do.

I'm irritated, however, that you picked on Cloudflare as a bad actor here when
we strive to be transparent and quick to get out full information whereas
others (e.g. Amazon) are slow as mud.

But I get that being on the other side of this is difficult when you don't
have information.

~~~
kevcampb
Amazon, Google, seem far far worse. I wasn't intending for my post to me
representative, just that the CF outage is recent and affects me more
personally than the others atm. Also, I had zero expectation my comment would
actually be read, this is quite a surprise.

If it's any incentive, if the status page had any inkling that something was
wrong, I'd be first on here posting "omg their status page is real".

~~~
jgrahamc
Fair enough. Thanks for making the comment. Made me think about how fast we
act in terms of public status pages. And sorry that route leak affected you.

------
jumbopapa
No outage for me, but I'm seeing some messages appearing twice. Is that
related?

~~~
koenigdavidmj
My org is having a few people see that, or messages getting silently dropped,
or failing to edit an existing message.

~~~
jumbopapa
Okay, yeah. I'm seeing some of these issues now.

It's odd because the message said it failed to edit, but my client still shows
the message as edited.

------
sidcool
If slack outage brings it on top of Hacker News, I think it's a compliment!

------
TelmoMenezes
In other news: an unexplained productivity spike has been recorded today
across the tech industry.

~~~
darkwater
Or not. I'm remote and I cannot ask questions nor coordinate action to solve
live production problems due to this outage. Slack is becoming a SPOF for many
organizations, especially distributed.

~~~
majewsky
Same here. I'm on support duty and have my own outage to attend to, and that's
really really hard when you cannot communicate with your experts.

~~~
jt2190
Forgive me, but just to clarify... There is no “fallback” communication
channel for your team to use?

~~~
wil421
There’s email, phone, whats app, group SMS, some form of WebEx or Zoom, or
even a shared google doc spreadsheet with tasks for each outage.

If Slack is your only forum of communication with your team it’s time to
rethink your support structure and DR plans.

Edit: It’s very unhelpful if I point out issues and don’t provide solutions.
PagerDuty has a great article about Incident Response[1].

[1][https://response.pagerduty.com/](https://response.pagerduty.com/)

------
psnosignaluk
You mean I have to open OpsGenie and Grafana now instead of waiting for a ping
in the ops channel???

Also, how many IT service desk ticket systems just lit up with people asking
them to fix Slack?

------
oxfordmale
Has the CTO and key technical staff retired after the Slack IPO ? :-) In the
UK there is no real outage, although Slack struggles sending some messages and
this can result in some messages appearing twice. This problem seems to be
worse when communicating with offsite staff, onsite messaging seems to be
fine.

------
gonzo41
They should call this sort of event a 'Slackening'

------
dillonmckay
You must not be on the $12/user/month plan.

------
jzig
By accident I noticed that you can add as many characters as you want to the
end of the url and it will still load that page. Neat.

------
notacoward
Came here for the "Slack shouldn't exist anyway" comments and was not
disappointed. Or rather I was, but in a different way.

Best wishes to all of the folks at Slack and those who depend on it.

~~~
mstg
Honestly, the same. It's tiring that people just has to suggest self-hosting
IRC or any other software all the time when Slack is mentioned. It's a good
way to communicate and it's easy. Hope they resolve it, overall a bad
situation.

------
jbverschoor
So now everybody can get back to $WORK

------
jbverschoor
Wooo productivity!!

------
lokimedes
A resurgence of IRC is nigh!

------
valerij
reminds me of that one time a genius at IBM created an `ibm-global-
announcements` channel and force-invited 200'000 people in it, and then some
guy `@channel`d and all ibm's slack workspaces were down for 30 minutes

[https://status.slack.com/ibm/2018-03/f01d4c22cd953dd7](https://status.slack.com/ibm/2018-03/f01d4c22cd953dd7)

[https://i.imgur.com/Rk6Kdgp.png](https://i.imgur.com/Rk6Kdgp.png)

EDIT:

also that channel made using slack impossible for mac book air users, they had
around 80% cpu usage for slack. so basically entire marketing and PM part was
unable to work that day. developes machines were wasting around 20% on slack.

after people started complaining in that channel, posting in it was limited to
admins only, but they didn't lock commenting. so, for approx 6 hours all of
IBM was posting memes in ＴＨＥＴＨＲＥＡＤ as we dubbed it, and @mentioning the genius
who created that channel. next day the channel was nuked, not even an archive
preserved.

some guy calculated that the entire affair, considering electricity prices, a
20% decrease in developers productivity, and so on resulted in IBM loosing
several millions with that stunt

fun times

p.s. shout out to Martinj for that spicy jeff-coffee-mug meme

~~~
freehunter
There should be a way to turn off @here and @channel. It should be a user-
level setting, a channel-level setting, and an organization-level setting.
Even if my org or channel doesn't opt out, I should be able to.

~~~
twoquestions
I know you can in Discord, I'd be shocked if Slack doesn't let you do that.
_checks Settings_

Yes, you can mute a channel and suppress @everyone and @here if you want.
Still friggin nuts that a chat service can bring a modern machine to it's
knees.

~~~
freehunter
Ah I knew you could mute it but muting a channel does not suppress @here. I
did more exploring and found it's a multi-step process to suppress @here vs
muting, but it is indeed possible.

Learn something new every day.

------
rooam-dev
Did they enable "replace email" feature?

------
QuickToBan
Why don't people use free software instead, e.g. a Matrix homeserver run via
[https://github.com/matrix-org/synapse/](https://github.com/matrix-
org/synapse/) ? IRC without a BNC doesn't really work for a corporation due to
disconnects.

~~~
pmlnr
Because it's way harder to get it running right.

------
mbesto
Meta - can we not post outages to HN?

I understand that post mortem's are interesting learning lessons, but if
you're curious whether S3 or whatever is down or not right at this moment,
then use the subsequent status pages of those services.

~~~
majewsky
HN is a more reliable status page than most status pages.

~~~
triplee
Unless Hacker News is down, in which case you need to give up and go outside.

------
buboard
Well i m not affected. But then again i dont use it

