Hacker News new | past | comments | ask | show | jobs | submit login
Slack outage: Connectivity issues affecting all workspaces (slack.com)
468 points by abdullahdiaa on June 27, 2018 | hide | past | favorite | 263 comments

In light of how Slack and other companies haven't been able to get a decent level of uptime, I have to say, the company known to make huge web applications that don't go down in shame every couple of months is probably Google. I can't remember the last time Gmail was down. It just works! If google is down, probably your internet is down.

Their expertise and discipline in distributed applications is unrivaled. I'm guessing because they have datacenters everywhere with huge fat pipes in between, and their SREs are probably top notch who don't take shortcuts.

Google gets a whole bunch of things wrong at times, but somethings I gotta say, they've nailed it.

Google is expert at designing services which you won't notice when there is downtime.

Take Google Search for example. When there is downtime, results might be slightly less accurate, or the parcel tracking box might not appear, or the page won't say the "last visited" time beside search results.

The SRE's are running around fixing whatever subsystem is down or broken, but you the user probably don't notice.

The reality is this is how you design highly available systems, and it is also imo one of the reasons microservices have gained so much popularity.

Driving features with microservices makes it easier to isolate their failure and just fall back to not having that feature. The trade off is that monoliths are generally easier to work with when the product and team are small, and failure scenarios with distributed systems are often much more complex.

An analogy to your Google failure examples for slack might be something like the "somebody is typing" feature failing for some reason. In an SoA you would expect it to just stop working without breaking anything else, but one could easily imagine a monolith where it causes a cascading failure and takes the whole app down. Most services have countless dependencies like this.

While their mail service does have a remarkable track record for uptime, that same record is not shared by many of their other services.

I admin a number of GSuite accounts, and we experience fairly frequent (~monthly) periods of strange behavior with Hangouts/Meet, and Google Drive.

Fortunately Google is very good about providing updates via email to administrators as they're working through an issue.

Funny you should mention google, as something is down over there right now. lots of reports of chromecasts being dead right, assuming something at google is down which is causing this.

Oh, interesting. Thanks for pointing this out. I was having Chromecast trouble this morning and didn't even think to check if it was a widespread issue.

GMAIL is one use case, and it was one of the original services from Google and so it has one of the largest "bake" times with regards to knowing how to keep it online.

Every service/team has to go through a period of growing pains as they learn, improve, and fix the code to be more stable. You can't just take the learnings from one service and apply it to another, it has to be architected and written into the code and most teams start each new project/service with fresh code.

Facebook springs to mind as well.

Facebook quite often breaks their stuff and/or goes down, however, their outages usually last for just a few minutes.

They recently pushed out an iOS update for Messenger that crashed to springboard any time you tried to resume it from background. It took a couple of hours to get a new build up, plus however long for affected users to all install the new version.

I'd love to hear the story of how that made it through testing.

What does "crashed to springboard" mean?

Sorry, should have just said "home screen" for clarity, but SpringBoard is the iOS application that makes the home screen. It's akin to Finder.

A fresh launch of Messenger worked until you switched out and put it in the background. When you tried to resume it (either from home icon or task switcher) it would immediately die and could be launched fresh on the second try.

Basically every time you wanted to use it you either had to kill it in the app switcher and then launch it, or launch it twice.


My favorite part is that since Facebook doesn't do useful release notes (best guess because they're testing different features on different users and changes never actually land for everyone in a specific version), all the App Store said for the busted version was "We update the app regularly to make it better for you!" Oooops.

Though that's an interesting thought, I wonder if a feature had rolled out to a subset of users and it was crashing because it tried to pull some piece of account info that doesn't exist on accounts without it? Testing still should have caught that, but if the test accounts were all testing the new feature I could see it sneaking through. From my end it looked like a 100% reproducible crash on resume which is pretty sad to release.

springboard is essentially the Finder application on the iPhone - so crashed to springboard means crashed to home screen, basically.

Facebook breaks features very often. Sometimes things go missing and comes back a week later. Dropbox does this a lot too.

It's the same for all sites beyond a certain size. It's never fully up. It's very rarely fully down. It's gradually degraded in ways that you hopefully don't see, but sometimes do. Or maybe you don't see it, but others do. etc etc etc. Availability isn't boolean once you have users.

And it makes headlines when they are down even partially. Same with iCloud (although their track record isn’t the greatest)

And those SRE's?

They use IRC.

From the Google SRE book:

> Google has found IRC to be a huge boon in incident response. IRC is very reliable and can be used as a log of communications about this event, and such a record is invaluable in keeping detailed state changes in mind. We’ve also written bots that log incident-related traffic (which is helpful for postmortem analysis), and other bots that log events such as alerts to the channel. IRC is also a convenient medium over which geographically distributed teams can coordinate.


How many SREs does Google have on said IRC system?

How many SREs are at Slack, working on keeping their systems up?

Finally, how many SREs could your company dedicate to keeping an internal IRC server up, and supporting it as an internal product?

I can throw ircd on a server; no problem, but there's a little bit more to 6 nine's of uptime than `apt-get install`, the decision wether to use IRC or not should keep in mind Google's resources (in number of people, number of data centers, and amount of money to throw at redundant hardware) to make sure it never goes down, especially when the data center is on fire around you.

they use IRC and they have a previously-communicated contact plan with redundant contact methods for when IRC is unavailable.

There is also this https://status.slack.com/calendar, but they seem to grossly under report the actual downtime...

[edit] note that including this outage, they are reporting to have missed their monthly uptime guarantee 3 months in a row.

Yeah, stripe does the same thing with their status page. I get alerts that they have an outage at least once a week and more often than not it never shows up as anything in their history. Honestly this is my only significant beef with the service and I've been using it for years now with multiple integrations.

You know how much of the community uses one messaging system when 15 minutes after it going down, it has over 40 points on the front page!

This says a lot about how it's a single point of failure in modern company comms.

It's even worrying to think about how some users probably have production-dependent (dare I postulate it) workflows in Slack that get crippled by its outage...

ITT: Chat about decentralisation that will ultimately lead to no action.*

*Because we've had this discussion so many times before...

Yes it's a single point of failure, but so what? I don't particularly care whether other organizations fail at the same time as I do, I just care whether I fail. Hosting my own chat system does not solve that problem. In fact, it may make it worse because then I have to worry about system administration, and Slack probably has more expertise on that. It's likely that they can fix this problem for all customers faster than I can fix my problem for myself. And it's not like I'm crippled when Slack is down. If it's urgent I can use the phone, and my todo list is stored outside Slack.

> In fact, it may make it worse because then I have to worry about system administration, and Slack probably has more expertise on that.

Although with outages like these, I doubt it!

if the software is architected this poorly so that it can literally go down simultaneously for all clients, then why would I trust that it's secure?

> It's likely that they can fix this problem for all customers faster than I can fix my problem for myself. And it's not like I'm crippled when Slack is down.

Well, you can probably infer the former from the dependency on the latter. You use these tools because they can reduce the scrambling when shit does hit the fan, not because they are necessary.

In a way there's a second single point of failure though, right? So many people use Slack to integrate all kinds of things, and rely on their interaction with those platforms through Slack, that if Slack goes down then productivity halts and it's totally out of your hands while Slack themselves try to resolve the issue.

- You don't get GitHub notifications on pull requests and comments, so things don't get reviewed and merged if developers aren't in the habit of checking the PR tab on GitHub itself.

- You don't get CI notifications so you won't know how your latest test run or deploy is going without going straight into the CI service itself. Even worse when there's a failure and you're too used to having Slack warn you about that.

- Your team might depend on Slack so much that they don't know how else to efficiently communicate, and the most efficient channel to communicate a fallback is not available or rarely checked (e.g. email, face to face). So you get a lot of chaos as people come up with dozens of alternatives.

This is just poor discipline more than anything, putting too many structural eggs into one basket, but it doesn't change the fact that Slack has created that dependency.

If your team can't check on that stuff manually for a few hours while Slack is down, then I think you may have bigger problems.

If anyone on my team came to me and cited Slack being down as a reason for their inability to do their job, then they wouldn't be on my team.

Is it less than ideal? Yes. Is it a little bit less efficient to pull info instead of having it pushed to you? Yes.

Is the sky falling? No.

I think it's inexcusable for a chat program to go down in 2018.

* your hdd failed? Use a raid

* your power went out? Use a UPS

* your DNS went down? Use a fallback (slack2)

* your whole datacenter flooded? Good thing you have multiple replicated cloud instances that seamlessly take over

See, these are the issues that "the cloud" was supposed to solve. Not give us the same problems as before, just with a recurring bill for "chat as a service".

And inb4 "chill Mike it's just a chat server not life support firmware" yeah but slack is the most trivial software you can think of: send text from one computer to another. I see no reason this service can't be nearly as reliable as life support firmware in 2018. We've had over 30 years to get this right. Raise the fricking bar.

>slack is the most trivial software you can think of

This is like saying that food service at 30k feet in a passenger airline is trivial because all the server has to do is walk up and down a narrow aisle handing out food from a cart.

Since "you see no reason this service can't be nearly as reliable as life support firmware", one of two things must be true:

1) You know something nobody else knows. In which case great, you've stumbled on a huge opportunity to go put your knowledge to work and get stupendously rich by outcompeting this "trivial" software company. Get to it, genius!


2) The reason you "see no reason..." is that you're unaware of one or more relevant facts.

Which of these do you think is more probable?

3) slack will get their "chat as a service" monthly fee whether the service actually works or not, so why commit to higher levels of service? We can get our users acclimated to outages and then sell them "slack Premium, for Serious Business", charge an even higher fee, and get stupendously rich all over again. This is the "growth" that investors demand, no?

The dark truth is I suspect we're moving in the opposite direction. Abstraction layers designed with that "chill, it's just a %s app" mindset are making their way into safety critical applications.

Eventually somebody is going to die because their pacemaker decided to throw cycles at mining monero.

Slack is text, channels, images, video, sound, search, audio calls, video calls, screen share (and interface share), bots, myriad integrations, and more. Calling it just "send text from one computer to another" is wrong.

I think maybe their point is that even if other pieces break, why shouldn't it be possible for the text communication to keep working?

If trying to provide all the other things besides text causes the system to be unstable, then maybe those things shouldn't have been added. We need text. We just want the other things.

Let me add more reasons: 1) Software human mistake, when some software error/exception throws much larger issues, that require manual restore with service downtime.

2) Geodistributed datacenters is VERY expensive thing, so not implemented fully.

3) Bad system design, full of "one point of failure".

> ) Geodistributed datacenters is VERY expensive thing, so not implemented fully

You buy servers on aws-us-west and aws-us-east, and sync them . How is that very expensive?

I imagine you've never actually had to solve any of these hard problems, which is why you think it's so easy to do.

That's bordering on (if not crossing into) ad-hominem.

There was no accusation of "so easy", only so not expensive and supposedly (and previously, demonstraby) solved in the last 30 years.

They may well be "hard" or even "expensive" for some definition of those two words, but if it weren't, it would defeat much of the (stated/advertised) purpose of outsourcing/cloud.

You propose just to buy servers in 2 locations to keep Slack services up? Doesn't work, when you need to store gigabytes daily and have dozen thousand reqs/sec synchronized.

Geodistributed datacenter requires multiple direct low-latency multigigabit/sec connectivity, special software to manage, test and check it, skilled devops.


I know. there's totally not a command called rsync. And "replication" is just a word you hear on star trek along with teleportation.

Although I agree with your premise, I think the delivery takes away from your point a bit.

Specifically, you risk people piling on that rsync isn't good enough in the modern world and referencing the comment criticizing Dropbox as being little more than an rsync replacement [1].

Of course, the specific tool one uses is irrelevant. The data synchronization problem may not be well solved, but it has been very well studied, with a remarkable number of good-enough options.

So, no, there isn't just one "sync" button, as the parent comment snarkily suggested, but there may be two, one where you might lose the last N seconds of chat (perhaps temporarily) and another where you lose the ability to chat entirely for those N seconds.

[1] Although it had other criticisms, such as monetization, which are, naturally, ignored.

Oh, yes, someone in Slack clicked the button “Pause” and we all are waiting, when Slack’s hero will click “Resume” :)

They very likely have all of these protections in place, and more. Large-scale outages of mature systems are almost always a cascade of small human errors that, each on their own, would have caused negligible damage. It's only when they happen to align with each other that a large disaster is realized.

I worked at an open source company where they hosted their own IRC server. There are OSS alternatives to Slack and I wonder if that company has tried to adopt any of them.

This all goes back to one basic fact: The Cloud is Someone Else's Computer(tm).

If your hosted Confluence or Jira is down, you can go walk over to your IT team and they'll be like, "Yea we know. We broke something. We're working on it." If you're using a hosted (a.k.a "Cloud" solution), you're just kinda fucked. You can't even extract your data and try to run it locally if it's down (if that's even an option).

That's uptime-as-anecdote. Yes, you can throw your entire IT department at your outage instead of waiting on the vendor to fix it. How many of us work somewhere where the entire IT team is as large as the team that works on Slack's uptime?

How many self-hosted setups need the complexity and matching team size of a centralized service serving millions of users?

I remember the netsplits of IRC days...

Let's say the self hosted chat app does go down. Now someone has to fix it. Someone who probably has something better to do. In a cloud hosted solution, the person in charge of fixing your computer doesn't work for you.

My experience with self hosted solutions is that they go down way more often and take longer to fix than cloud solutions.

I'm not sure about production dependent, but I'd love to see how many other companies have longer/worse outages thanks to this. There are definitely a lot of people counting on Slack as a sole channel to push low-level error notifications, and I doubt most of them have an easy fallback option.

reading all this thread made me realize at my company (~50 people) we have a couple slack-bots that control a number of things, deploys being one of them. shrug

It's not so much decentralization as chaos engineering.

Building the program with withstand failure after failure, of things in and out, of your control. Seems like Slack needs some chaos engineers...

In my company we use Cisco Jabber for official comms but Slack unofficially. So when Slack goes down, we fall back on Jabber.

My company has a customer support system that relies on Slack chatops.

It's an interesting morning, to say the least.

To me it raises a concern, chatops and slack integrations are /very/ common, it's a form of vendor lock-in on their side and it makes absolute sense.

However, if you become dependent on chat-ops to do your job. (say: fallbacks for common things have eroded due to lack of use) then suddenly your company is crippled. And why? for a chat service? The value add from slack is grotesquely small in isolation.

What channel of communication did you pick talk to your teammates on Slack? I've received messages by Facebook Messenger, Line, and the rusty email :)

Here's a chat decentralisation platform: https://www.ratbox.org/ .

I have used Mattermost and been pleased with it. It is an open-source Slack clone you can run on a low-end VM or your own hardware.

Luckily most of my active communities are on Discord nowadays. It works much faster and even has a dark theme by default.

But it's not any less centralized, which I think was the complaint, not that it was popular even though this was mentioned.

> which I think was the complaint

Spot on, not that I know anything about Discord's architecture...

I say this as someone who almost always prefers the dark theme wherever it is available: I wonder how much this desire for dark interfaces comes from almost every app interface having bright colors on white.

Somewhere along the shift to flat design, grays and non-bright colors have been ignored in the visual design of applications.

In civil engineering circles, it's known that a room which is too bright will cause eye strain and fatigue. There is an optimal level of light for the eyes to be most effective. But the computer makers and UI designers don't take this into account. Dark themes transmit less light to the eyes, causing less fatigue over time.

The trick is this" if you look into a bright light you cant see the rest of the room anymore. Its a kind of forced feeding. Not that the designers of our world are guilty of some sinister CONSPIRACY. They simply see awful-white as the only choice based on tests or they are just imitating what they know.

I just installed Dark Mode for Firefox [1], it makes all websites have a dark theme. My eyes are already thanking me.

[1] https://addons.mozilla.org/en-US/firefox/addon/dark-mode-web...

> even has a dark theme by default.

I love how having a dark theme is second only to "it works" in terms of how we pick services these days XD

It's lovely how in Slack.app if you want dark theme, you have to modify the internet javascript files...

Sure, it's worrying but worth it for me personally. I might go to jail due to this (seriously) but at least people won't die. For me that's the threshold.

You can't leave us hanging like that. How could a Slack failure possibly send you to jail?

Hope Slack considers doing a post-mortem similar to Gitlab[1]. Sharing what they learned and giving customers context is appreciated.

[1]: https://about.gitlab.com/2017/02/10/postmortem-of-database-o...

Yes, that way we can beat them up for years to come based on whatever mistake they made. It would be even better if they told us which employee made the mistake so we can incessantly mock that employee openly and publicly every time Slack is ever mentioned on HN. When GitHub was purchased by Microsoft, Gitlab came up quite a bit and we got to rehash that whole database outage over again many times over those few days. It was sad.

If it were my company, I would say a little as humanly possible.

It's not about assigning blame, it's about sharing lessons learned with the broader community and being transparent and honest with paying customers about issues that may have significant impact on downstream productivity.

It’s not about assigning blame for the company writing the post-mortem. But it’s definitely about assigning blame for most people reading the post-mortem. Very few people read post-mortems for the sake of learning how to be better at release engineering and ops.

If I pay for your service, and you are transparent about mistakes and flaws, I will be more forgiving about mistakes and flaws in the future, and appreciate the work you do to fix them.

If I pay for your service, and the only communication is, "We know there is a problem, and we'll let you know when it's fixed", I may assume you are not equipped to thoroughly explain the problem, and therefore not well equipped to solve it.

The blame is already assigned. The users already know there is a problem. A post-mortem likely has a positive effect for the readers attitude toward the handling of the issue.

It’s more the people who don’t pay for the service, but might, that are quickest to see post-mortems in a negative light. The only reason they have for reading them is looking for justifications for culling the product/service from the list of contenders for when they ever have to evaluate solutions in that category.

In other words: post-mortems are good PR, but incredibly bad advertising.

And a world-wide outage followed by "we fixed it and trust us it won't happen again" is going to filter any service off of my list more so than "we had a single point of failure running in our CTO's basement and his cleaning lady pulled the plug. Trust us it won't happen again."

I entirely understand what you are saying, believe me I do. But that is not the way some communities take it. We still see messages like "You could move to Gitlab but... you know they dropped their production database a couple of years back? Use them at your own risk!"

We learned a lot from the Gitlab outage. It was a simple mistake and not one they will have again, yet people still beat them up for it. I'm not sure the value is there for the company to be super open about their outages and issues.

On the contrary, I would trust them quite a bit less, not more, if they had an hours long outage without any explanation.

Perhaps - but would you even remember it, without the juicy details of what happened? I probably would forget if some service had a few hours downtime a year or two ago, if I didn't know any details to make it stand out from other outages.

Wouldn't they have gotten beaten up over the outage even more had they not offered an explanation?

In my experience, customers are often seeking an explanation/post-mortem because their customers are seeking an explanation. If an upstream service goes down for an extended period of time and all you can do is go back to your customers and say, "Your system was down because our provider's system went down for 4 hours. But they won't tell us why.", that not go over well.

Gitlab's response to the the database mistake was a large contributing factor in my decision to move all of my repositories onto their service.

Anecdotal, sure, but people like me exist. I don't know if we're in the majority. You'd have to measure somehow and do a cost-benefit analysis I guess.

I hope you don't work in aviation with that attitude!

As usual people are taking a comment and twisting it any old way they'd like. Which is fine, that's why we have these communications. To start off, no I am not in aviation. I have run quite a few companies and development departments.

I am not suggesting Slack or anyone else should not communicate at all when they have an outage. A public postmortem, which many people are asking for, is one method. Is it the most effective method? I doubt it. Many people are suggesting that as paying customers they would like to know what happened. Does a public postmortem tell the paying customer what happened in an effective way? Maybe, but maybe not.

When I am running a company I care very much what my paying customers think and are feeling about my service. I will communicate issues directly to them. Do I need to explain to the rest of the world in some great technical detail what happened during an incident? Absolutely not. Do I need to have the first post in Google about my company be an outage postmortem? Of course not. I need my PAYING customers to be pleased with the service I offer and to understand how I will mitigate the damage I have done to them. To me, that's a basic principle of business. I don't have to explain to everyone. I owe everything to my paying customers. Gitlab did a postmortem almost immediately after a major outage and some people tried to slaughter them with the information they shared. It was sad and unfortunate. Their openness was met with some horrible results from the community.

Also, I use Slack. My company uses it for everything including ChatOps for my production environment deployment. We have a hundred of so active users. The outage this morning harmed us. But you know what? I don't pay for Slack. I owe a lot to Slack but they don't owe me anything. I can't blame them for my problems this morning. They are a free service to me. I appreciate that their absolutely free service servers my company so well almost all of the time.

My company does pay for slack, pays a lot, and I expect an RFO

Excellent! If you somehow read my entire message and got out of it that Slack shouldn’t give you detail about the outage this morning, then I somehow did not portray how important it is to emplain issues and resolutions to paying customers. I hope you get a full break down and understand exactly how they will keep you from having this sort of outage again. If they don’t, then it becomes a value issue to decide whether you should move to another system.

My point is only that it does not have to be a large public explanation. You, or the decision maker at your company, who pays a substantial sum of money to slack for their service, should have an explanation until you are satisfied.

Maybe unrelated, but my AWS-hosted websockets-using app had an outage starting at the same time. Also a third-party API provider we use for handling inbound phone calls. So this smells like a wider outage than just Slack.

When I was in Moscow a few weeks back, Slack wouldn't work. Exact same behaviour - it loaded up the gui, loaded up previous conversations, but then wouldn't work past there.

Russia blocks a lot of AWS IPs, when I did a full VPN out to a server in Germany slack came good.

That's interesting. More speculation: they haven't given any detail in 2 hours, perhaps if it's an upstream/3rd-party problem, they haven't been given any info.

I know it's not exactly scientific, but the front page of https://downdetector.com shows a number of services that have problem spikes starting anywhere from 3am US/Eastern to 9am US/Eastern and continuing through now (11:24 US/Eastern): Google Home, Fortnite, Exede, Level 3, New York Times, AWS. Maybe totally unrelated to each other, who knows.

That certainly does look suspicious there - especially level3.

I'm wondering the same thing. I chose this morning to soft-launch my side-project/startup and sent out the sign-up link to my e-mail list. Of course, it's AWS Cognito-based, was working yesterday, but failed for the new users. Great timing! Phone support said they are looking into some outages (even though the status page is all green).

Telegram was down too just half an hour before slack. Dunno if they run on aws?

They do, and GCP.

I recall AWS/GCP public IPs getting banned in Russia when they were trying to block telegram.

Maybe I'm reading too much into it, but "We've received word that all workspaces are having troubles connecting to Slack." makes it sound like their internal monitoring didn't catch whatever is causing this. I was personally experiencing issues for about 20-30 minutes before the status update was posted.

Pretty much every time there's a slack outage it takes them a solid 20 minutes to update their status page. Several times I've emailed them 10 minutes into an outage (following "nobody at the office can reach slack, but their status page says smooth sailing, we should do more diagnostics in case it's office internet or something..."), then gotten a response 10 minutes later to the tune of "we're aware, we just updated our status page, go look at that". I think they consider updating their status page a PR problem, so they avoid if if the issue can be fixed in under X minutes.

Which also makes their uptime totals completely bogus.

We started having issues connecting to Slack 4 hours before they reported a status.

Their subsequent update makes it sound like they still don't have a clue.

> Our team is still looking into the cause of the connectivity issues, and we'll continue to update you on our progress.

I think that is just the tongue in cheek language they like to use.

Yeah, it's really funny and ironic when you cost customers money!

It's interesting to me that the update messages are posted every 30 minutes from 1st notification until resolution. Judging by this and every other outage I assume this is automatic, and probably implemented to appease the people who are probably frustrated by the outage.


We have a similar policy at $WORK (but manual). In our experience customers go mental if you say absolutely nothing.

There is also zero information in those statuses, which kinda defeats the purpose. Might as well just have the status landing page with no details.

Good catch...definitely updating every 30 minutes exactly.

It's times like this I wish there was a solid decentralized standard to pick from, but there's no clear choice between XMPP and Matrix.

We use Slack for everyday company wide communication/ announcements and Riot for encrypted secure communications (you can host Riot yourself): https://about.riot.im/

It's not about the protocols, it's about having a client with a user experience that is acceptable to an entire company rather than just a team of engineers. Which decentralized protocol has such a client? (Speaking as someone who got burned trying to advocate for IRC at a company that eventually and inevitably switched to Slack.)

Curious why you got burned with IRC client UX given the multitude of clients available for it.

The multitude of clients is one of the problems! How do you find them? Which one do you use? What features matter? Nobody knows! They just want a product with chat rooms and don't understand why it seems so hard to do seemingly simple stuff like create an account or search for that link that someone posted a month ago.

Technical people who haven't used IRC can barely figure out IRC their first time using it. Trying to sell IRC to a company would be hilarious. Bob in Accounting getting on IRC and feeling comfortable with it's UX?

Then you've never seen the 'hilariously' bad UX they already put up with, with things like Quickbooks.

mIRC is pretty straight forward compared to that.

A hypothesis I like is that when it's an application you use to communicate with other people, people are a lot less tolerant of a confusing UX.

The reason is that when you sit there clicking through a bunch of menus to find something in QuickBooks (or a typical atrocious enterprise app), nobody sees you; and if you screw something up there, you spend some more time fixing it and nobody sees the screwup. Frustrating maybe, as you waste time, but almost everyone has some frustrating wastes of time at work.

If you're on IRC and people are talking at you and you sit there fumbling to figure out how to respond, it's like you're in a conversation and tongue-tied and everyone's looking at you. And if you screw something up, like send a message to the wrong channel... now you've done it in front of all your coworkers, in real time. Humans hate looking stupid in front of the group.

And if you screw something up on IRC in front of your coworkers, and you're someone with even a little anxiety about not being tech-savvy... that's going to flare right up.

Also, because now you're embarrassed, you're going to want something to blame. So you blame the tool.

Yes. Also, QuickBooks is accounting, which is supposed to be hard while "chatting" with people is supposed to be easy.

QuickBooks doesn't have to suffer in comparison to better UX performing similar tasks in people's personal lives while IRC can be compared (unfavorably) to texting apps, Facebook Messenger, Twitter, AIM once upon a time, etc.

mIRC offers a fairly good UX compared to all those.

If you're setting it up in a corporate environment, just change the ini files so it autoconnects to your server. It'll pop up a list of channels they can join. The server can SAJoin them to particular channels on connection too. The UI is very clean and lightweight: a channel scrolls messages and they appear, there's an input bar at the bottom, and there's a list of users on the side. It's written in MFC and Win32 APIs, so it's blazingly fast compared to most applications, and you can find a version that will run on every computer made in the past 25 years.

The united states military used mIRC extensively for battle field coordination. I think it's up to the task of handling bob from accounting.

An image search for mIRC shows that it is ugly as shit. It has a sidebar to list channels but the current channel window is still an undifferentiated mess of handles, commands, and actual conversation. Stored communication is mainly a server-side problem but I don't know if mIRC has an interface to show DMs you missed while offline or to indicate which part of a channel's conversation happened since you last looked.

Even if mIRC would suffice for Windows, you've not handled Macs, phones, etc. Who gives a shit if it runs on a 25 year old computer?

The US military has produced some specific examples of good design but isn't known highly valuing usability, let alone whether someone would enjoy using a tool. IRC is very functional and mIRC appears to add a little polish beyond a pure command-line interface, those are bare minimums and not good enough.

> An image search for mIRC shows that it is ugly as shit.


> It has a sidebar to list channels but the current channel window is still an undifferentiated mess of handles, commands, and actual conversation.


Each channel and private message get their own MDI window you are free to minimize, maximize or layout however you want.

Notifications are turned on by default, but they can be disabled. You'll get a tray notification if mIRC is minimized, and inside the title bar of the window will flash. Notifications happen when your nick is mentioned.

There's a horizontal line that goes across the dialog window that indicates the location of the conversation the last time it was focused.

>Even if mIRC would suffice for Windows, you've not handled Macs, phones, etc. Who gives a shit if it runs on a 25 year old computer?

Other clients work on other platforms. mIRC is just what I brought up since it's desktop windows client and that the most common case for an office environment.

> The US military has produced some specific examples of good design but isn't known highly valuing usability, let alone whether someone would enjoy using a tool. IRC is very functional and mIRC appears to add a little polish beyond a pure command-line interface, those are bare minimums and not good enough.

It's a simple, light-weight way for people to send short text messages in near real time with tens of thousands of people. I think that's good enough, and it works at a scale that far surpasses the SaaS chat options.

There's plenty of decent XMPP clients, like Spark (https://igniterealtime.org/projects/spark/) but they'd take an IT team to configure.

Matrix has Riot (https://riot.im/app) but personally I find it incredibly confusing.

I'll have to take a look at Spark.

I don't think it's a problem if something needs to be initially deployed and configured by an IT department (or otherwise tech savvy individual or group), as long as its onboarding and primary usage flows are straightforward. An arbitrary non-tech-savvy but internet-familiar employee needs to be able to create an account, browse and join rooms, and search through history without any hand-holding. Slack and its direct competitors pass this test. IRC doesn't. Does Spark?

It's certainly the closest of all XMPP clients I've used, since it has a very friendly interface. Their related Openfire XMPP server is also targeted at internal deployments and is very easy to configure with a web UI.

Zulip is amazing, is a self hosted system without federation is an option for you.

Love the workflow with Zulip, but I hope they work out a way to join in with either Matrix or IRC3+ federation.

Matrix is my preference for sure. It's fresh and exciting. While XMPP is harder to talk people into trying

Their site (https://matrix.org) reeks of hype-oriented engineering. From the most cursory overview of their home page, their decentralization looks a lot like IRC peering.

The federation elements actually work pretty well, more similar to XMPP than IRC.

If you are interested we are building a communication platform for communities fully based on XMPP https://movim.eu/ :) It can easily be deployed on a Web server.

There's always IRC ;-)

IRC is actually viable, with https://riot.im for offline logging and mobile access.

Create a new one![0]

[0] https://xkcd.com/927/

Despite having a vote increment velocity far exceeding other items, a publish time of only 25 minutes ago, and more points, this item just dropped from #5 to #7 on the front page.

How’s that work exactly?

Edit: It’s now droppped to #14 even with comment count also rapidly increasing.

300 comments in one hour will definitely kill it. HN penalizes controversy, which it uses comment count as a proxy for. It works well most of the time

Comment count is a factor.

edit: it's a negative factor...

Thanks for the clarity, this makes more sense now!

Quoting myself from 8 months ago [1]:

> I really don't understand these types of questions. The possible answers range from "because the ranking works that way" to "someone with privileges wanted it that way". On either end of the spectrum, the real question remains: so what? What difference does it make why a particular post is in a particular position? If the title seems interesting, you click on it. If not, you move on.

> I don't mean to question you in particular. It just seems like such a trivial concern to me that I truly can't understand why someone might possibly care.

[1] https://news.ycombinator.com/item?id=15576036

Hmm, it's now off the front page entirely, which seems strange. I don't see much incendiary commenting or similar...

@dang, care to comment?

This might be the longest downtime I've ever seen for Slack.

IRC had uptime in the scale of decades. Why are our 2018 solutions so fragile?

Eh, IRC networks split and individual servers went down all the time. But yes, there rarely was a complete EFNet outage even if sometimes there were 2 versions of the same channel going at once.

That being said although I like some slacks fancy features I do wish a distributed alternative could catch on.

Native emoji support, aesthetically pleasing front-ends, and clear product direction are some of the main positives I see, even if the combination of php on the backend and electron on the frontend aren't the most sophisticated technical components in history.

I prefer decentralized and open things, but a cohesive vision can sometimes provide a better user experience across a more restricted set of functionality than an army of hackers, each solving their own problems.

Offline messaging, mobile clients, push notifications, history, search, rich text formatting, message editing, file transfer, etc etc etc

..., inline images, display names, deleting messages, editing messages, reactions, avatars, multi-user private messages, etc etc etc

Have you tried IRCCloud? Their web based front-end is as nice as Slack's but it still works with decentralized IRC servers. They also manage the client's state (unread messages) better than regular IRC bouncers.

Emoji seem to work just fine on IRC nowadays, what do you mean by "native" support? The shortcodes? The fact that there's official clients you can entirely rely on supporting it?

Native emoji support, pretty front-ends, and clear product direction are possibilities on-top of IRC (or XMPP) since their absence isn't a core part of IRC (or XMPP) -- it's just not a good way to make a profit it if you don't lock down the network and act as the gatekeeper of the interface. Slack's API is fairly open though and it's not a huge hurdle to interact with it. I built an IRC<->Slack gateway that bridges the differences fairly well ( https://slack.tcl-lang.org/ , you know, if Slack were working).

IRC netsplits quite a bit, breaking ongoing conversations until it’s recovered, to be fair.

Small ircds that you would run for a single team don't split because it's a single server.

Large networks can have the servers go up and down, and it's still not a big deal because of redundancy. DNS round-robin entries mean you don't even have to know the other servers on the network.

In 2018 netsplits caused by down links are fairly rare. If you wait six months you might see one.

And if you run a small single-point ircd, at some point, the server or it’s internet connection will fail, and you’re in the same position as when Slack fails.

There’s nothing that gets around technical failure. Either you have a single server that’s going to die at some point due to sheer entropy, or you have a somewhat complex distributed system with the tradeoffs you desire that might fail anyway.

The downtime would be for a network connection failure and not because your 'fearless' NoSQL container didn't work as expected. If a transient networking problem like this is a big deal for you, you can easily add either more nodes or move the node to a place with more reliable networking.

Or because the IRCd written in 90s-style C++ by some people who honestly don’t know what they’re doing segfaulted, or because you accidentally K-lined, or because you accidentally filled up the disk with logs because the server’s maintainer was fired and nobody remembers how the system works, or the latest system update borked something, or a failure to update the system allowed someone to attack your network, or the really hacky mechanism you use to enable auth against Active Directory broke or allowed a disabled user to log in, or...

There’s a lot more that can go wrong than that a database falls over. In my experience, IRC servers fall over all the time - it’s just that nobody really cares because their clients just connect to the next server in the list and people resume their conversations a minute later after figuring out what messages actually reached their destination.

Paying IRCCloud to manage an IRC server for you is a reasonable option, but I wouldn’t do it because I think it’s going to be more available, but because I like IRC and believe it provides the functionality I need.

> Or because the IRCd written in 90s-style C++ by some people who honestly don’t know what they’re doing segfaulted,

Don't use a 20 year old ircd then. Use something like ratbox or InspIRCd.

> or because you accidentally K-lined or because you accidentally filled up the disk with logs because the server’s maintainer was fired and nobody remembers how the system works, or the latest system update borked something

Don't let 14 year olds run your server.

Anybody knows why the netsplit was written

  *.net *.split
Why those stars and dots ?

I believe it's an effort to show a netsplit in the traditional form (server1, server2) without placing blame on a particular server.

Back when I IRCd regularly (and perhaps this is still the case today), certain servers would get a reputation for splitting more than others, and I think this network (and/or its ircd) decided to mask it without breaking the general format.

It's to avoid showing the server names to the public on some IRC networks for various reasons including security through (some) obscurity.

I recall net splits happening all the time so I'm not sure that ircs uptime was a practical real world "decades".

I mean, most IRC networks at least have netsplits from time to time.

The power of centralization! If I can't have it, you can't either! I wouldn't say it's fragile, though. Just like normal IT work, people only pay attention when it isn't working.

Really? It maybe up, but Netsplits occur all the time.

sdf43543t345 has quit the server (Net split)

Because they're single-source, single-point-of-failure (in the case of Slack), and very very very complex.

Major IRC networks had uptime on the scale of minutes, in my experience…

writing entire stacks in javascript, probably

Given how much more robust Slack is than IRC as far as features go, it's probably not fragile. The closer a piece of software is to the network layer, the more stable it tends to be, just due to the internet's robustness.

Spare a thought for the chatops crowd who may be blissfully unaware that their own infrastructure is down

I have little pity for the folks who decided that re-implementing a shell in a chat application was a well-conceived notion.

Hopefully they're learning this lesson.

I can just see it now. A company's app is dying from all the timeouts to a slack webhook, however they can't deploy because slack is down.

Oh god. Who would have a Slack webhook as a blocking part of their pipeline??

It's just login that's not working it seems.

I keep getting automated push notifications from our bots (but I still can't connect to the app myself).

We use it for things like relaying user-flagged messages to our support team and reminding us when scheduled content has been automatically released.

Surely some of the Slack team are hiding on here? Any idea what's going on? ;)

If they are, I sure hope they're doing something more productive than surfing HN

Very much agreed. I meant it as a tongue in cheek comment. I am however deeply disappointed by the vague and useless updates on https://status.slack.com/2018-06/142edcb9e52c7663

They might as well have written:

  - nope but maybe at some point a yep
  - still nope
  - nope
  - nope
... I know many companies don't like to give details in the heat of the moment (and the engineers that understand are likely working on it), so I really do hope they give us a good retro after it's all over.

Downtimes happen, I get it, but this one lasts for 3+ hours already. Can't even remember the previous time when such a large service was down for so long.

Seems to be fixed... with zero info on their status page about what went wrong or otherwise.

>We're happy to report that workspaces should be able to connect again, as we've isolated the problem. Some folks may need to refresh (Ctrl + R or Cmd + R). If you're still experiencing issues, please drop us a line

Hilariously, their "uptime in the last 30 days" still shows 100%.

While I appreciate the timely status updates, it almost seems like Slack has built a random status update bot to post updates that don't say anything exactly every 30 minutes.


Even Slack Enterprise is affected. So much about "this runs on your own infrastructure".

Slack Enterprise has never run on your own infrastructure.

It doesn't? I thought that was its only reason to exist.

It’s pitch is a little different, it gives you ‘Workspaces’ which are somewhat connected, and tools to manage big deployments.

IBM, Oracle and many large companies use it because 100,000+ participants in one workspace is quite unmanageable.

Think channel namespacing whilst unifying user provisioning and enabling DM and MPDM across the entire company. Users can have access to one or many namespaces, they sign in once and it populates all enabled workspaces into that users client.

You can share channels between workspace within Enterprise Grid fairly trivially (although this now works between Slack tenancies owned by different companies too!)

Still runs on the same infrastructure in AWS as other Slack customers though.

From a policy perspective you can push down settings to all Workspaces in your SEG, and define whether you “centrally control” or “delegate to Workspace owners” on a setting by setting basis.

for a decentralized alternative: https://matrix.org/blog/home/

They have too many 'decentralized', i.e. blockchain, things on their landing page for my liking. However, since blockchain 'technologies' are the wonder kool-aid for everything and given that messaging 'apps' are trivial compared to rocket surgery, how come that there isn't a messaging app that is decentralised with these wonder technologies, were it only costs you a few cryptokitties to get your messages and where you earn a few dogecoin to forward on other peoples messages?

Just to be clear, the Matrix protocol and technology does not involve Blockchains in any way.

The issue I'm having is that I can't send messages. Reading them is fine. The dashboards on the walls at SlackHQ must be pretty interesting right now.

If you want to stop being able to receive them as well, just hit reload (then you won't be able to reconnect).

Yeah, a strange failure mode; I was even notified of a new message a few minutes ago, but couldn't post a reply.

This is a weird outage...

- I am in UK

- I had similar problems last night (around 2AM GMT) but status.slack.com was all green, and my colleagues in the US seemed to be using it okay

- Currently it's completely down on desktop for me (waiting to reconnect...)

- Connecting through a US VPN does not resolve the problem on Desktop, even though my US colleagues are using it on Desktop successfully right now

- Mobile works for receiving and sending messages, but there is a delay

Anyone else seeing symptoms like this?

What does Slack as a company use to communicate when Slack the product is down? :>

Almost definitely IRC, just like Google does. https://landing.google.com/sre/book/chapters/managing-incide...

It's really disappointing to get this update after ~2 hours of downtime

> We have no new information to share just yet, but we're continuing our efforts. Your patience is truly appreciated. https://status.slack.com/2018-06/142edcb9e52c7663

Does anyone know about any Slack alternatives that support bots, GIFYs and emojis ? And above all that is not ridiculously slow.

There's Google Chat if you use G Suite like us. https://chat.google.com/welcome

Or Atlassian's Stride is really great too: https://www.stride.com/

If your org happens to be part of the Microsoft Office 365 ecosystem, there's Microsoft Teams. All of the products support bots, gifys, and emojis. I personally think Google Chat and Stride are much faster than slack too. I haven't tried Microsoft Teams yet.

We just spun up Mattermost, seems ok for the first hour.

The funny thing is that sending normal messages just times out 90% of the time. But /me comments generate an error after timeout:

slackbot9:28 AM /me throws shoe at slack failed with the error "ASSocket: timed out reading 4 bytes from adminserver-3wvr:10443"

We've switched to google hangouts as an ad-hoc workaround..

Are there solid Slack-style self-hosted alternatives? (The gitlab of the slack world, to be clear.)

Matrix (https://matrix.org) is a good alternative to both Slack and Discord. The most complete client implementation is Riot (https://riot.im/).

The protocol itself is federated, so you can communicate with other Matrix users from your self-hosted instance. There are also bridges to IRC, XMPP, even Slack..

Gitlab itself ships with Mattermost <https://mattermost.com/> :)

This doesn't look half bad... while slack's down, seems like a pretty optimal chance to try https://github.com/mattermost/mattermost-server

Atlassian, the folks behind bitbucket, jira, and (now) trello have a self hosted product: https://www.atlassian.com/software/hipchat

They also have a Slack alternative, called Stride.

And HipChat is horrible. Doesn’t even sync which multiple devices and their mobile app doesn’t support the iPhone X screen size which is a trivial update to make considering HipChat is used by some pretty massive customers. Code highlighting is still, (after years,) pretty bad.

Hipchat doesn't exactly have a stellar uptime record either.

Has anyone here used Stride? Reviews?

https://zulipchat.com/ has a topic/thread concept that is just awesome!

Already tried Movim (https://movim.eu/)? It's fully relying on XMPP.

I use gitter.im for some oss community chat. It's okay.

We're giving Mattermost a good try now!

Mattermost is actually integrated with GitLab - so yes - options do exist.

There’s stuff like Rocket Chat and I think Dropbox released one as OSS a while back resultant from an acquihire, but the name escapes me now.

My team jumped onto IRC, which is working pretty well. No custom emoji, but that's about it.

Do you run your own server?

Fixed, but their status page is rather optimistically reporting 100% uptime this month.

Slack and Chromecast having an almost simultaneous worlwide outage... have they something in common? Same enemies?



This is why I like working on iOS apps, if all hell breaks lose I can't do anything. When something like Slack goes belly up, imagine those poor folks having to respond.

Yikes. This one looks like a doozy. Herre's to hoping there's a post mortem this time around.

Just as we sign a £100K+/annum contact with them...

At least it happened before we've migrated.

There are so many open source solutions, Mattermost, Rocketchat, etc, why are companies willing to pay $100k/yr for Slack? What was the defining feature that others didn't have? Even Discord feels like it has far more features than Slack.

Name recognition, employee familiarity (I've used Slack at every job I've worked since early 2015, I pretty much know what to expect from it always), and punting maintenance costs (this is probably the biggest factor).

I love IRC and XMPP. I'd love to run one of those, or some new service (Matrix?), at work. However, my time is arguably better spent doing anything _other_ than maintaining such services, and the same goes for most engineers at most companies, sadly.

Side factor: the mobile clients for IRC and XMPP almost universally suck, at least on Android. I imagine if those problems had been solved in a reliable way, more companies may consider them (assuming the allocation of engineering resources problem isn't a problem).

There may not be one defining feature. For some it may be look and feel/attention to detail. It could be the number of well supported Slack integrations, or various enterprise features wrt message retention and deletion, SSO and auditing.

Telegram was having an outage at about the same time, probably unrelated but still: http://downdetector.com/status/telegram

What backup methods of communication are distributed teams using when Slack goes down?

My group uses Roman fire signals [1].

[1] http://www.romanobritain.org/8-military/mil_signalling_syste...

this is a greatly underrated comment.

Email, if it's actually important. Then if emergency, you can always hop on IRC, Skype, Messages/SMS, etc.

All my team also uses Zoom, so we just switch to a group on Zoom when Slack goes down. It's not quite as nice as Slack but it gets the job done.

My team uses Voxer threads as backup during times like this. We also use Voxer when it takes too long to type explanations out on Slack.

Secure Scuttlebutt

Is this a bob reference??

Message in a bottle

RocketChat - Self-hosted, not backup, primary. Outage like this validates that decision.

That's a lot of eggs in their basket. A major downside of the current way a lot of these companies work is that there's a huge incentive for them to never allow customers to self-host their product.

I wonder who is writing those status messages. as a customer this reads like "please wait, ETA to fix unknown" in 10 different ways.

Back online here.

Here too

Thats the problem without self-hosting your essential stuff.

Reasons my self-hosted servers have gone down in the past year:

- Scheduled electrical maintenance that facilities manager failed to disclose (even though they knew about it for weeks).

- Emergency power-down because two of the four air conditioners failed at the same time.

- Someone accidentally powered off the VM.

I'd much rather have an hour long outage here and there than incur the cost of defending against these circumstances (and still have it go down for some new unforeseen reason).

>- Someone accidentally powered off the VM.

how is that self-hosting when you don't control the hypervisor in this case?

it usually implies that you at least have some sort of control. either having a real server somewhere (with ups and stuff) or at home, where you know when power is out.

while what you are doing is technically self-hosting, I would have changed the VM provider after the first incident like you described.

Funny, our self-hosted infrastructure goes down weekly.

No offense, but you're probably doing it wrong

You just gave a good argument of why s/he should use Slack. Yeah s/he might be doing it wrong, but so what? One should focus on core business, not system administration.

Yup, every time one of the popular centralized XaaS platforms go down, there's always the snarky "Heh, well my stuff is self-hosted...", and they are always the types that have no idea how to value their time.

self-hosting doesn't imply automatic and better uptime than the hosted version.

most people are much better off letting dedicated teams of tens of people take care of the hosted version for them.

I don't see it as too much of a problem so long as you're not one of those teams that orchestrates their deployments using a Slack bot. Just make sure you have an agreed backup mechanism, i.e. Google Hangouts, Zoom, etc.

Less blowback typically when you can blame a third party though.

I imagine people felt more productive today without Slack

Relaxing morning off on humpday!

Still no work on the cause or fixes after 90 minutes. I wonder what kind of problem this could be.

give them some slack guys

Skype still up

back up for me.

And productivity is up!

What productivity? I am here on HN

Can confirm. No wait, the first thing I did was check HN and now am reading this and oh there's another interesting thing.

HN's noprocrast setting cures that.

It works all too well. This will be my last comment of the day before noprocrast kicks me off HN.

Thank you for pointing this feature out. I had no idea what that flag did.

Hacker news can take the... slack.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact