Hacker News new | comments | show | ask | jobs | submit login
Slack is offline (slack.com)
303 points by soroso on Oct 31, 2017 | hide | past | web | favorite | 122 comments

This is probably preaching to the choir, but hosting your own FOSS chat is nowadays a very viable way to avoid being dependent on a centralised service like Slack. Your options include:

* Riot.im / Matrix.org (decentralised global network; e2e encryption; open protocol)

* Rocket.Chat (Meteor-based; focus on UX and feature)

* MatterMost.com (clone of Slack UI; open core license)

* Zulip.org (all about threads!)

* ...or indeed IRC or XMPP.

(disclaimer; I work on Matrix).

Is there reason to believe self hosting will have better uptime?

Uptime aside, you also have to consider the effect of a single point of failure when you self-host. If your in-house communication hub goes down when your site does, it's going to make firefighting that much worse and you'll pay for it in a longer outage.

Here at FB, a lot of day to day coordination takes place via FB products. But production and release engineering communication happens over IRC, especially during major outages. The fallback factor in critical to keeping the plane in the air.

People like to jump on bandwagon or the other, but the real answer is - it depends.

With Slack, the application itself is probably pretty tough, but for a lot of businesses their infrastructure and connectivity TO Slack (ie internet/WAN) is probably not very resilient. So for a lot of smaller outfits I'd say that Slack is better.

But if you're a large org and your infrastructure is very resilient and diverse, then you're probably better off self-hosting - assuming you can leverage your existing infrastructure to do so.

yes. small scale is almost always simpler and less error prone than massive scale.

slack also has pretty dubious quality standards. e.g. their desktop app and their atrocious replacement for screenhero.

The biggest benefit of slack for me is their search. All messages are indexed ready to be searched. Code snippets, images, giphy, attachments, bots. It’s a whole ecosystem, not easily replicable with irc.

1. Self hosting doesn't have to operate at the scale of slack, so there's a whole slew of issues avoided. Pushing text messages around really isn't that difficult when you aren't serving millions of customers.

2. You can perform maintenance outside of office hours, with SaaS you don't get to decide when an upgrade (and potential outage) happens. I don't care about 99% uptime, I care about having 99% uptime while I'm working.

3. You can have backup services.

There's also a wholew slew of issues driven right into.

Such as? If you've got less than 1000 users then you need an extremely basic server, a raspberry pi should more than suffice. Then you've just got a little bit of manual (or automated) administration, software updates and backups mostly.

I really didn't expect my post to be so controversial, is the HN crowd really so terrified about running there own hardware?

I'm guessing that you're being downvoted because there's a lot more to consider. I agree that it doesn't take much hardware these days (most single-board computers would work perfectly well) to service <1k simultaneous chat users with efficient server-side software (e.g. UnrealIRCd or ejabberd). However, to make it as reliable as Slack (99.99% monthly uptime is their SLA) for the price they offer it ( https://www.slack.com/plans ) would likely take considerable engineering effort. Sure, you could set it up, toss it in a closet, and it might have 100% uptime for a year...until it doesn't. If chat is business-critical, there are chat companies that have profit motive to deliver a good service. If chat is a nice-to-have at a company (and you e.g. don't have to worry about data retention laws / compliance stuff), maybe it's fine to run it on an rPi / t2.micro (free) AWS instance.

Luckily, there are a ton of great free and paid options out there these days!

For $6670 a month (price for 1000 users), I’m pretty sure most people here can spin up two VMs in two different colos, and setup IRC servers or whatever.

99.99% uptime means it can be down for a few minutes a month, so all it needs to do is fail over properly. In practice, it will probably have many more than 4 9’s.

I think the real reason slack does well is ease of client + service setup, the brain-dead UI, lots of feature creep that a few people care about, mobile clients, etc, etc.

I’m not a huge fan, but it could be worse. At least they didn’t leak everyone’s password like hipchat did.

$80,0000 a year, for that sort of money you could hire an IRC developer full time and get them to spend a day or two managing the company server.

How can anyone find a decent developer for $80k? Even not factoring in overhead and benefit costs.

In most of the world that will buy you a decent mid level developer at least, a great senior or two in many. Even if it's below market, if this was my pet OSS project I'd happily take a pay cut to get more job satisfaction.

Generally yes, given the reasons others have said. Other than that, at the very least, outages can be dealt with more proactively when you have your own setup. Third parties won't have the same priorities that your company does.

Since Slack's main business is chat, they have a pretty good incentive to get everything working again ASAP. Here's their SLA for "plus plan" and Enterprise plan:

  Our Plus plan Service Level Agreement (SLA) guarantees a 99.99% monthly uptime1
  We’ve designed our SLA to be simple and transparent — based directly on the information we make publicly available on 
  Slack’s System Status page.
  If we fall short of our 99.99% uptime guarantee, we’ll refund customers on the Plus plan 100 times the amount your 
  workspace paid during the period Slack was down.
Source: https://get.slack.help/hc/en-us/articles/204113126-Plus-plan... + https://get.slack.help/hc/en-us/articles/115003205446-Plans-...

Chat is a commodity these days. For most businesses, it probably makes more sense to just let the companies in the business of offering paid chat services do their thing.

Don't see why you were voted down on this, since it's true. Slack working to get things running again doesn't mean they're prioritising your companies particular instance or region. They're likely to be making sure their own region and their own stuff is up and fixed first, so anyone away from the east coast of America is likely to get seen to after that. It would be stupid to do it any other way, since slack employees are likely affected as well and they're the ones trying to fix it. Down voting someone pointing that out is pretty fanboi-esk or really naieve.

Pretty much, if you don't own the service, you don't get to decide where in the queue you are for a fix.

Do these solutions have good mobile frontends?

I run an XMPP server for my friends. We use Conversations [0,1] on Android and BBOS, and Zom [2] on iOS. We use OMEMO [3] for encrypting most of our conversations, and while it isn't perfect, it usually stays out of the way.

Generally, the experience with the mobile clients has been quite good. Conversations and Zom are stable, attractive, and featureful. The biggest issues are some interoperability problems with desktop clients (displaying messages that should be hidden) and some things which I believe are server-side configuration issues.

Zom hides a some useful configuration features (in the name of being dead-simple to use), so I'm trying to convince one of my iPhone-owning friends to try ChatSecure [4].

[0] https://conversations.im/

[1] https://f-droid.org/packages/eu.siacs.conversations/

[2] https://zom.im/

[3] https://conversations.im/omemo/

[4] https://chatsecure.org/

Matrix via Riot has quite a great mobile client

I run my own Mattermost server and the mobile version is very lookalike the Slack one (I only use chat so I don't know if there's other functionalities on Slack that are not in Mattermost)

So I guess yes :)

love MatterMost as it is ITAR compliant

"International Traffic in Arms Regulations"?

How is that relevant to a chat app?

if u work on military or DOD projects,u cant use most cloud platforms like slack. MM solves that compliance requirement

Yo, people who like to complain that Slack just re-implements IRC... This Is Your Moment

At least with IRC I could connect to another server.

Or host your own

Clears throat


> Anyone who doesn't know, Slack actually runs off the IRC protocol underneath the hood.

[citation needed]

Pretty sure they provide an IRC interface, but almost certainly don't use IRC internally. There's almost no way they could support any of their fancier features using IRC. Reactions etc would be horrible to implement.

Yeah and the IRC interface has been getting worse recently. When they added the shared channels across teams they completely broke being able to '@' a user from the IRC gateway. Support said something along the lines of 'yup and we're not planning on fixing it'.

I'm expecting them to completely turn off the IRC gateway in the next year or two.

Embrace, extend, extinguish. It's the SaaS business model.

I've switched to using TwistApp (https://www.twistapp.com) with my team. Unlike Slack where you have channels where everyone talks about everything, TwistApp bases conversations around threads. Every problem that's being worked on has its own thread. Once it's completed, I close and archive the threads. Very effective for getting things done as every task is isolated in a separate thread and discussions don't overlap.

Also read this post by Amir, the founder of TwistApp, "Why we're betting against real time messaging" - https://blog.doist.com/why-were-betting-against-real-time-te...

Twist also has a native Mac app.

When the status page is returning a 500 error... not a good sign.

On the other hand, makes it sound more likely to be a routing/reverse proxy issue instead of (say) a database issue. Those sound easier to deal with via a rollback vs something like "oops we dropped a critical index on the `messages` table".

With the length of the outage it seems more like a DB issue. And if this is still correct there look to be some fragile dependencies.


The API isn't even returning valid error codes.

  logging error: {"subtype":"api_call_error","message":"{\"ok\":false,\"error\":\"_http_error\",\"status\":0,\"retry_after\":null}","stack":"Error\n  
As you can hit their api servers I am betting some replication error in mysql but that is just a guess based on that case study.

I am betting they are saying 'connectivity' because that is the error the client logs.

It is back up for me, but now I am annoyed they can't follow specs at all.

For l

They ignore:


And kitchen sink everything under the xdg config dir...

  ~$ ls .config/Slack/
  Cache/                Cookies-journal       
  dictionaries/         installation          Local 
  Preferences           QuotaManager-journal  
  Cookies               databases/            GPUCache/             
  local-settings.json   logs/                 QuotaManager          

It's an even worse sign when the page half-loads with some stylesheets missing.

Maybe they should not serve the static content with "Cache-Control: max-age=1". That's rarely a good idea.

Yeah, it alternates loading telling me everything is fine, and just giving me a nginx 500 error. Seems that the status page should be hosted differently so it can be up even when other things are having issues.

Returned for me when I was looking but took ~30 seconds to do so. They should host their status page on a separate domain and use something like CloudFlare in front of it to help with sudden spikes in traffic. Another alternative is to use Twitter / Facebook as the status page and let them deal with the traffic spikes, or just serve static HTML.

It's a sign that we need to appease the beer gods.

You could use this time to read The Slack threat https://carlchenet.com/the-slack-threat/

I'm hoping they publish a public post-mortem. Learning from this kind of outage is the best kind experience for engineering - though it's far better when only staging goes down and not prod.

Looks like they won't: https://twitter.com/SlackHQ/status/925586114152411137

We've no solid plans right now as we're focused on tidying things up internally, but will consider it. Thanks again for holding tight

Hasn't Slack learned yet that you're supposed to host your status page on a different infrastructure?

I think they did.

The slack.com IP's are owned by AWS, while status.slack.com resolves to some Digital Ocean IP's.

Then why did the Slack Status page have so many problems at the same time? Half the time loading it would give a 500 Internal Server Error, 45% of the time you'd get broken resources (images and/or CSS), and only 5% of loads would give you the full working page.

Maybe they underestimated how much resources their status page needs during an actual outage.

Maybe because it's under a lot more load during an outage and they haven't upsized the status page infrastructure to handle their ever increasing user base.

DNS issues?

And of course today's the first day we're using Slack for audience Q&A at a conference. 360 folks in a room now have to...raise their hands! So barbaric.

Slack is currently down and I've realized, for better or worse, what Slack has really done.. It's created an expectation for immediacy. I thought about sending my question to someone via email but then just thought, "I'll wait for Slack to be back up, it'll be faster anyhow".

My first thought:

"Slack is down? Better post to Slack and let the team know."

You're not alone. I had the same thought.

I worked for BlackBerry when outages started to become a thing.

When your business motto is 'always on' - it's really, really bad to be 'off' - it's a deep transgression of the brand promise.

BB was structured poorly for this - they didn't grasp the concept of multiple nodes of redundancy very well. (Easy in hindsight).

But - we should all be impressed at how highly available Google, FB and some other brands are. That's impressive.

Team: if you're reading this, get out the radios.

I just realized that being a remote team. I can't join any of my teammates. Neither see live changes to the infrastructure and the repositories.

The things we take for granted.

If slack being down means you lose all insight into your build process and code management, you seriously need to introduce a secondary option immediately.

OP didn't say "all insight". There's a difference between being unable to see a stream of change events and not being able to see the current state of the system. The latter is completely unacceptable, whereas the former is just annoying.

If you're a fully remote team it'd probably be wise to figure out some fallbacks.

I'm sure they have fallbacks, but when their ecosystem (apparently) evolved around Slack, the fallbacks are less effective. Polling Jenkins to see when your job is done is more time consuming than receiving a Slack message.

I'd like to message members of my team about this issue, but slack is down...

Email lives on.

Maybe Slack decided to give us all a break for Halloween :)

Also curious: why is this so low on the front page? 250 points, posted an hour ago...

Reminder why your status page should be hosted in a very different way than your regular infrastructure... so you're way less likely to end up with issues on both at the same time.

I like how statuspage.io even has metastatuspage.com in case their primary domain/DNS/TLD has issues.

Reminder that you should check things first before commenting. Slack's status page is in a different infrastructure, somewhere using digital ocean while slack.com is using AWS.

Nice! I wondered why my productivity suddenly doubled up!

My RocketChat server is running just fine ;P

So is my jabber server!

This page seems to not be throwing 500s: https://status.slack.com/2017-10/8b0d4d44ea53726f

This is certainly scary for the Slack devs... Happy Halloween!

Spoopy day at Slack

Causing chaos at my workplace. We have Slack integrated into our incident management solution... very, very unfortunate.

Slack DOWN! Productivity UP!


Wow, talk about realising how much we rely on Slack.. suddenly I feel so disconnected and alone.


That was the most annoying scrolling experience I've ever had.

Links that hijack my scroll wheel earn an immediate downvote.

don't push code at 4pm! and on Halloween, oof..

  We are aware of connectivity issues and are actively lnvestigating.
  3:58 PM PDT・See in your timezone
They spelled investigating with a lower-case 'l' :\ Does that bug anyone other than me?

When your hands are shaking from adrenaline because your pre-IPO company is suffering from a global outage, you might hit an l instead of an i.

No worse feeling than typing out a status update without any idea of what's going on.

Just a simple typo. The keys are pretty close together if your finger slips, and I imagine they have enough problems distracting them from proper spellchecking at the moment. :-)

It doesn't bug me. But I really want to understand how that happens.

Why CPU goes like crazy every time slack loses connectivity?

You can easily test this by disconnecting from wifi. As soon as you're offline the fan starts spinning until you get your connection back.

I'm pretty sure that "refresh" tries to bootstrap the whole world. Eg, it'll reload all JS assets in addition to just restarting WebSockets

That's why there's StatusPage.io :)

I think you meant to say statuspage.io, not statuspage.com

FYI, slack seems to be up again now (at least for me.)

not anymore? - EDIT: the status page was fine and then it wasn't.

Seems like every other request is throwing a 500 error. Maybe one server in a load-balanced cluster is erroring out?

I'm not really a Slack user, but isn't Keybase always an option?

If by "always" you mean "as of a couple weeks ago, and with a small subset of the features".

Just a spooky coincidence it's on Halloween?

Is this getting flagged off of HN?

Looks like it. When I first loaded HN 2 minutes after this outage started, the story was #2. Then I refreshed after 3 minutes and it wasn't on the front page at all. Used the search tool to find it and then upvoted.

Looks like somebody wants to keep this quiet. 65 points in 12 minutes is good enough to be #1.

Outages without any information are not intellectually interesting, and typically neither are the discussions that follow.

It is infeasable to keep a global outage quiet.

High comment rates can also trigger the "overheated discussion detector", which will downweight a submission.

Perhaps the HN mods should do something about that then.

If you see something like this and you think it's in error, you can let them know and they'll likely be able to respond more quickly. There's a contact link in the footer.

You're joking, right?

No, I'm not. In my experience the mods are quite responsive, and have explained site behavior on more than a few occasions. They've also adjusted flags and weights of submissions if they identify an issue.

Granted, there’s not much constructive discussion that can happen as an outage is happening.

Discussion? No, I agree there. But it _is_ relevant "hacker" news right?

Probably just the ratio of comments v. votes. Too many comments relative to the number of votes will lower a post's ranking, IIRC.

I'm probably not making this any better by commenting, of course.

How is there nothing on the front page of HN, slack being out for almost an hour now?

Given timeframe and upvotes, how is this not the top of HN?

It is now

How can I tell my team that Slack is down? :-)


the loss of revenue.. is unquantifiable.

O.o god help us

Slack has a nice market share, but also many competitors, many of them 100% ripoffs with the same features (to name a few, Attlassian HipChat and MS Teams... not to mention open source products).

Slack has been experiencing service degradation often lately, so I would not be surprised if people start switching.

In our team we already started looking for an alternative.

HipChat came out long before Slack.

Yeah and they've had service degradation and availability issues long before Slack also. ;-P

Did not know that, thanks for point it out.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact