Hacker News new | past | comments | ask | show | jobs | submit login
Slack is down (slack.com)
989 points by gpmcadam on Jan 4, 2021 | hide | past | favorite | 803 comments



All: large threads are paginated, especially today when our server is steaming. Click More at the bottom of the thread for more comments, or like this:

https://news.ycombinator.com/item?id=25632346&p=2

https://news.ycombinator.com/item?id=25632346&p=3

(Yes, these comments are an annoying workaround. Their hidden agenda is to goad me into finishing some performance improvements we're badly in need of.)


When I was at Uber, we noticed that most incidents are directly caused by human actions that modify the state of the system. Therefore, a large "backlog" of human actions that modify the system state have a much higher chance of causing an incident.

My bet is that this incident is caused by a big release after a post-holiday "code freeze".


To elaborate a bit more on this point, you have to think about it like any complex system failure - it's almost never one thing, but rather a combination of many different factors. The factors around post NYE releases:

- high risk changes that weren't released pre-holidays get released. Depending on the company, this could mean a 1-week to 1-month delay between implementation and release. The greater that interval, the higher the divergence world of production and the world of the new feature

- lots of new hires (new year = new hiring budget). New hires are missing some tribal knowledge about the system and make a production-breaking release.

I tried to think of other reasons, but these two overwhelmingly stand out as the two biggest reasons. Would love to hear from others.


If new hires tends to break production, it's not in the first business day of the calendar year. December gets really quiet for recruiting, typically, as candidates get busy with their social lives, and scheduling interviews gets harder.

January is busy for recruiting, but given a week or two of interviewing and negotiating, two weeks notice, it's probably February before new employees are starting, and they're not making big, production-damaging deploys for a week or two after that.


You will also get a pause in new hires in late December for the same reason. I've certainly accepted an offer late in the year and then didn't start until the new year.

Probably not as big of a rush as the end of school year rush in summer though.

I also doubt that new people will be breaking production on day one. Even at a fast moving startup I'd expect it to take a bit to go through the onboarding paperwork, get a laptop and actually try pushing a change to production.


I think some big company (maybe Facebook) has this rule that you had to deploy something to production on your first day. They seemed pretty confident in their processes and devops teams. A company trying to imitate that policy without doing the work necessary to make it possible would probably have outages on days when lots of new people joined :-P


Could be Facebook as I think production releases are always rolled out in phases e.g. first to 10 users, then 100, then 1000 and so on. That means there's much less chance of even the worst mistake having a serious effect.


Wow, onboarding new hires here is going good, if they can access slack, O365, LDAP, VPN and clone the repo by the end of the first day. Tho we have the initiation ritual of installing the OS to your laptop.


Sudden surge of traffic as all their users returns to work?


Could be, it's the perfect time overlap between US-West, US-East, and Europe.


Yes - I wondered if they took some servers down prior to the break as a cost saving measure, and forgot to reinstate them.


Doubtful. It's not impossible a company the size of Slack would be reliant on a specific engineer logging on in the morning before a traffic spike so the service can handle the spike in load, but that's a misuse of modern distributed cloud-based computing.

Hate on the cloud all you want, but AWS has (several flavors of) load balancers and various ways to automatically scale up and down resources (and if you're conservative, you can disable the 'down' part). If you're operating a major SaaS company like Slack and not taking advantage of them, something's gone wrong.


It's easy to fall behind on bumping up the high watermark for your max autoscaling or for new traffic patterns to cause emergent instability. New code paths are taking unprecedented amounts of traffic all the time.

In 2021, how does one keep track of resource starvation at the process, container, os, service, pod, cluster, availability zone and region levels?


I would add here the potential scaling issue - holidays were a dry season - less meeting. So if they have some automation for scaling down to reduce cost, it may have bitten them in their arses now.

People came back to work, and most of them start around the same time (US wise at least).

Hence kids - a vital lesson for all of us - don't start the call at a full hour, give it 3-7 min to make your coworkers confused and give some time for the systems to auto-scale ;)


I hope this isn't the case. It's not like this is the first holiday season for slack.


I think you're right on the first bullet, but not the second. If it was mid-Feb, then maybe, but the next FY hasn't even started yet for a ton of companies, let alone onboarding newbies to production.


People returning to work and downloading a huge backlog of messages from the past two weeks.


As an Microsoft Teams Ex-Dev, I can vouch that message retrieval after vacation puts a lot of stress on storage systems before it stabilizes :)


Yeah, makes sense. A system typically optimized for performance and real time delivery is suddenly asked to perform multiple batch retrievals in large chunks. Ouch!


I would bet it's just the influx of traffic post holiday with systems that haven't been updated in so long maybe some annoying memory leaks have crept up and gone unnoticed or some other bad state that was exacerbated by return to work day for most NA folks. Code freezes were good at identifying bugs that only show up after long periods.

Doubt anyone releasing big changes Monday morning.


I haven't worked at Slack, so I can't speak with high confidence. A traffic spike is a possible reason, but I'm willing to bet that it's not the reason:

> Doubt anyone releasing big changes Monday morning.

This is definitely an engineering best practice, and by best practice, I mean something that Uber's, I mean Slack's SRE team strongly pushed for, and got politely overruled on. After a code freeze is lifted, it's quite common for lots of promotion-eager engineers to release big changes.


In my experience it's not promotion-eager engineers that want to push after a code freeze, it's antsy product managers. YMMV tho.


IMO it really doesn't have to be promotion-eager engineers or antsy product managers. I'm fairly satisfied with my role and comp and work type with where my career/life-stage is. I just did a code release first thing this morning, not because I am promotion-eager, but just because I'm picking back up where I left off, like any normal day. Granted I work at a much smaller company than Slack with orders of magnitude less traffic.


What's there to change in Slack, though? It's arguably a messaging system, and that feature is tried and tested. That, and giphys, to be honest.

EDIT: Guys it was a joke, chill


HN's tolerance for jokes and sarcasm is extremely low.


I'm not sure about that. I feel like I get more upvotes from sarcasm and jokes than from insight. In this instance, I think it's because when people hear something dumb said seriously in real life, they're not going to readily recognize online that it's a joke.


Yeah, Poe’s law applies here. That’s definitely something someone less informed might say in earnest.


Yeah, there was other thread about Uber, where similar sentiment was seriously debated there, so I didn't recognise this as sarcasm either.


You just don't deploy something major the first day after a 2 weeks vacation, it does not makes any sense.


Why? I had a rewrite of some core logic the last day before Christmas that I didn'td deploy, as it wasn't time critical to get out and I didn't want to be disturbed during holidays. Today it was perfect to deploy, as I can watch it the whole week if needed.


Well, I think it probably depends on where you work. At my work, people just took 2-3 weeks of time off. It takes a moment to get your head back in the game.


Yeah, I do this all the time. I don't want to be bothered on the weekend, so I push releases at the beginning of the week when possible.


Same, I would rather release on a Monday than a Friday.


Everywhere I've worked often has a massive backlog of things that get released after a moratorium or extended holiday week. Those are usually the worst weeks to be oncall since things are under so much churn.


8am in the first day is too early... But by 10am, after catching up on emails, it's totally time to start releasing stuff.


It depends on the goal you’re trying to accomplish. Are you going for a promotion or bonus? Or instead is your goal to maximize uptime?


I doubt that regularly releasing breaking changes that reduce uptime is a good strategy to get a bonus or promotion.


Assume promotion after releasing 10 changes

Releasing 1 change a year with a 100% chance of working -- no promotion for 10 years

Releasing 10 changes a year each with a 10% chance of breaking something -- 1 in 3 chance of promotion in a year, and a 2 in 3 chance of downtime


> Assume promotion after releasing 10 changes

Where did that assumption come from? Also are you claiming that it takes 10x more time to release a non-breaking change?


Interesting, I've never worked anywhere where engineers decide when to release changes. That's a product decision, and there is a process of review and approval at both the code level and the functional/end-user-experience level that has to happen first.

Did you mean that literally? E.g. is it common at Uber that engineers can release changes to production on their own?


At Cisco (Webex team), the engineers decide when to release code, and most features are enabled by configs or feature flags independently of the deploys.

The engineering team is responsible for the mess caused by a bad deploy, so it's appropriate that those engineers should also choose the timing.

Our team typically deploys between 10am and 4ish, local time, since that's when we're at our desks and ready to click through the approvals and monitor the changes as they go through our pipelines.

The feature enablement happens through an EFT / beta process, and the final timing of GA enablement is a PM decision. But features are widely used by customers ahead of that time, as part of the rollout process.

Our team usually rolls out non-feature changes to services via dynamic configuration switches, so that we can get new bits in place, and then enable new behavior without a redeploy. This also enables us to roll back the dynamic config quickly if something unexpected happens.

(We generally don't do this for net new functionality; there's lower risk in adding a new REST endpoint etc. than in changing an existing query's behavior or implementation.)


Does Uber/Slack not release in CI/CD? At least in backend?

I don't see any need to deploy a big change at once in the software world today. At worst feature gate the thing you want to do and run it in a beta environment, but still push the actual code down the pipeline.


> run it in a beta environment

Every Uber/ex-Uber engineer is nervously chuckling at this comment right now


As the saying goes, Everyone has a Test environment. Some people are lucky enough to have that distinct from Production.


For those that don't know what this comment is about: https://eng.uber.com/multitenancy-microservice-architecture/


I'm actually more confused after reading that. I assumed that you meant that tested in production on purpose, but it sounds, at a skim, like they do non-prod testing environments - in fact, it looks like they've gone to having multiple beta environments of every service?


My understanding is that they have a "tenancy" variable in every service call which can take a different code path. They seem to only have one environment for everything and do tests/experiments at code level based on this variable.


Ah, thanks; that explains it nicely


Aaah the wonders of not having to be PCI or SOC2 compliant...


That might be true but when you take the global usage of Slack and their respective time zones, more than half the world would have signed into Slack this morning before SV had and I certainly didn't notice any downtime this morning in my time zone.


It was ropey before SV woke up, I thought it was just my (normally rock solid thanks to using Ubiquity) network having issues.

Guess it was Slack being Slack.


What would make that strange? Where I work it is frowned upon to do releases on weekends and so bad changes due to buildups happen on Monday.

Although, we also don’t close the pipeline for just any holiday break. In fact low holiday traffic is a good time to keep pipelines open, since changes will impact less people.


I have definitely worked in places where the times right before and right after a change freeze were the most unstable, so that could be it. However, as others have mentioned, it's pretty early on the west coast of the US. Unless some engineer was up extra early (perhaps at the behest of an anxious project manager) it seems unlikely to be a release.

What it could be is some engineer somewhere coming in after the holiday, noticing a slightly flaky thing, and thinking, "I'll reboot/redeploy/refresh this thing so the flakiness doesn't get worse". Only it turns out the flaky thing was a signal of something else falling over. Or maybe the redeploy was the wrong version because of bad CI/CD, or maybe the person just fat-fingered it.


Most releases are automated with time lockouts.


In what companies?


Competent ones like those you'd hear about being down on HN.

At least that how it worked at one FAANG


It varies a lot by team... I think it's common to have a single click "start" button to press. It's a good sanity check that a release isn't going to happen during a fire drill, outage, or strike...


Or unless that engineer was not in the US


Very possible. I don't know what Slack's workforce distribution is. In places I've worked there have definitely been some incidents in US off-hours triggered by someone on the other side of the world.


Another common cause is resource exhaustion as a result of poorly monitored resources (or bugged monitoring). For example Google's authentication was down because their system reported (wrongly) available quota of 0. The last two incidents at my company were also related to resource exhaustion.


This is one of the original concepts why to go capital-A Agile. Make smaller releases more often, so at least if something breaks, it's (hopefully) something small, and least it's easier to trace.

(I'm not making a statement if that's good or bad or if it works or whatever. Please don't read an opinion into it.)


This. If you roll many changes into a single deployment, you don’t know which change broke what. But if you have two or three weeks of commits waiting, it’s hard to do otherwise.


That's why good regression tests and CI are so important; in an ideal world (which we were close to in one of my projects), every change is pending in a pull request; the CI rebases the change on top of its upstream (e.g. master/main), simulating the state the codebase will be in once merged, and runs the full suite of tests. The build is invalidated and has to be re-run if either the branch or upstream is changed.

Now, caveats etc, this was a collection of single applications in a big microservices architecture, and as the project grows it becomes more and more difficult to manage something like this, especially if you get more pull requests in the time it takes to do a build. But it is the way to go, I think.

Anyway, since tests and CI are not definitive, you also need a gradual rollout - 1%, 5%, etc - AND you need a similar process for any infrastructure change, which gets more and more tricky as you go down to the hardware level.


This is very likely a broken release. The timing lines up with pacific time too well.


They declared the issue at 7:14AM PST. How long is their deploy process?

That sounds pretty early to think somebody on the west coast did something, other than maybe acknowledge the pages and declare the incident.


Slack does progressive roll-outs. The broken release hypothesis seems very unlikely.


If you think about it, modifications to state of the system caused by human actions are the sole purpose of computers.


Seems to be more than that. Even slack.com in an incognito browser fails.


What does an incognito browser have to do with anything?


An incognito browser would ignore all client-side cookies, so the Slack web client would not try to - say - resume a previous user's session or re-use any previously saved data.

Likewise, incognito mode will also ignore most cached web content, meaning all assets on the Slack web app will get loaded again from scratch. This "clean state" start could, theoretically, get around issues with old - potentially incorrect/outdated - assets being loaded, even though that really shouldn't happen under most circumstances.


Sure, but why does that indicate the issue probably wasn’t related to a code push, like the person I responded to said?


It means that one is not sending a session cookie of any kind, thus should be sent to a 100% cached version. No "Are you XYZ and what to log into ABC's Slack again?" box.


That means it's not a user auth error


non-logged in user may not go through all the same codepaths as a user with cookies present.


This is why Change Management is the main tenant of principles like ITIL


I'd like to take this moment to mention self-hosted, open source, and federated alternatives like XMPP and Matrix.

I'd like to, but unfortunately I don't feel like I can in good faith. Matrix is woefully immature, and suffers from a lot of issues, but I think is closer to being a functional Slack/Discord alternative. XMPP is much more mature, and works very well for chat, but doesn't have a nice package that does all the Slack stuff--at least not that I'm aware of. I'd love to be proven wrong there. I know it can be done, but if it can't be deployed quickly by an already overstressed team member, what chance does it have?


The problem is that XMPP and Matrix are protocols, not products.

Element (the primary Matrix software) definitely has Slack and Discord in its sights.

I don't think there are any serious "self-hosted Slack-like" contenders that are XMPP-based right now. You can piece components together (yay, standards!) and I did exactly this for the IETF's XMPP deployment recently. But it's far from being a cohesive easy-to-deploy product. Simply because nobody is building that right now. It takes time and resources and there's no money in it.[1]

People who do set out to build Slack clones (projects like Mattermost and Rocket Chat) and earn money don't have features such as federation on their priority list and don't build on top of Matrix/XMPP. They roll their own custom protocols and as far as I can see they are fairly content with that decision.

[1] There's even less money it, but nevertheless I am currently working on such a self-hostable "package" for XMPP. However rather than focusing on the team chat use-case (Slack/etc.) I'm focusing on personal messaging (WhatsApp/etc.): https://snikket.org/ if you're interested. It's possible I will broaden the scope one day.

EDIT: typo


It's largely overlooked that the success of Slack & MS Teams is partly due to the cybercrime portal that email has become. IOW, you don't get phished in your org's Slack chats. To prevent phishing, any chat service will suffice; an open protocol isn't necessary, as you don't intend to engage with ppl outside your org.

The essential problem IMO is how to replace SMTP. No one has proposed and implemented an alternative, to my knowledge. So I decided to[1]. The current draft omits federation (although I wouldn't rule it out in all cases yet).

[1] https://github.com/networkimprov/mnm/blob/master/Protocol.md


No, EMail has fundamentally bad UX for a lot of use case slack and similar are used for.

> problem IMO is how to replace SMTP.

Sadly SMTP is probably one of the parts of Mail which have aged best. Enforcing the usage of some (currently by design optional) features wrt. authentication and similar at the cost of backwards compatibility and you have all you need from the delivery protocol.

BUT:

- IMAP and similar is much worse.

- Mail bodies are a big mess it's always fascinating for me that mail interoperability works at all in practice (again you can clean it up a lot, theoretically, but backwards compatibility would be gone).

- DMARC, DKIM and SPIF which handle mail authenticity have a lot of rough corners and again for backward compatibility are optional. Again it's not to hard to improve on but would brake backwards compatibility.

The main reason mail still matters is because it's backwards compatibility, not just with older software but also with new software still using old patterns because of the (relative to the gain) insane amount of work you need to put into all kinds of mail related components. But then exactly that backwards compatibility is what.

(Yes, I have read the "Why TMTP?" link and I have written software for many parts around mail including SMTP, and mail encoding. The idea that SMTP is at the root of the problem seems to me very strange. Especially given that like I mentioned literally every other part of mail is worse then SMTP by multiple degrees...)

EDIT: Just to prevent misunderstandings one core feature of mail is the separation of mail delivery and mail authenticity, in the sense that you don't need the mailman to prove the authenticity of a mail. At most the legal/correct/authentic delivery.


> No, EMail has fundamentally bad UX for a lot of use case slack and similar are used for.

The opposite is also true.


By "replace SMTP" I mean the whole email protocol stack, not only SMTP. I'm not proposing to replace it for all situations overnight; of course SMTP etc will be used for decades.

TMTP also covers most IMAP/POP use cases. And it allows short, plain-text messages (see Ping) to make first contact with others -- necessary when that server has less restrictive membership requirements.

Authenticity is a double-edged sword. For certain confidential content, you want the recipient to know that it originated with the sender, but you don't want anyone else to know that in the event the content is leaked or stolen.

I believe the extinction of email for person-to-person & app-to-person correspondence is a foregone conclusion, due principally to phishing. The question is what should we do now, and the answer is clearly not chatrooms (which are of course useful in certain circumstances).


Email is not a chat system, and chat systems are unsuitable for asynchronous long-form threadful discussions. There is some overlap, but combined they form a spectrum of communication modes so wide that it can‘t be covered by a single UI.


I would argue that email is not suitable for asynchronous long-form threadful discussions. The limitation that email has is that if you're not part of that conversation from the beginning, you'll have to piece it together from previous quoted material.

One email like protocol that properly handles this is NNTP.


True regarding the late-comer aspect, although it is less of an issue when using mailing lists with an archive. In the past, when lacking an archive I also just asked another participant to send me the earlier discussion in mbox format, which was easily accomplished with the unix MUAs of the time.

Regarding the actual modes of discussion I was thinking of though, usenet and email are mostly the same.


> I was thinking of though, usenet and email are mostly the same.

For the most part, they are and many readers support both protocols (or at least they did in the past). The nice thing about NNTP is that it doesn't require maintaining a separate archive or having someone send you an mbox file to import. Just subscribing to the appropriate groups was sufficient (depending on the article retention policy).


I agree with both of you, and TMTP supports adding people to a thread after it starts (see PostNotify).


I'm not finding much about TMTP or postnotify with a search through Google. Could you link to some resources?


I've only just begun publicizing it, after getting the client & server implementations to a point where folks can evaluate them.

Protocol: https://github.com/networkimprov/mnm/blob/master/Protocol.md

Why TMTP? https://mnmnotmail.org/rationale.html

Follow: https://twitter.com/mnmnotmail


Why would you replace it? Will not disabling all public un-authenticated submissions on your mail server suffice? You can also prevent delivery to outside world (and error out on submission so that users are notified) if you really like. Result will be your own private mail server.

And you can keep using all the normal MUA's on desktop and mobile.


Changing your SMTP server configuration that way would break things, so the question is whether to set up a new, company-internal SMTP server, and give your employees new addresses there. But that won't quickly stop the phishing, because your ppl still need to get email via the public network from clients and suppliers.

Setting up a new server isn't easy unless you hire an outside service provider, and if you're willing to do that, Slack et al offer a nicer UX than the well known email/webmail clients.

Orgs with sufficient IT resources commonly do run internal SMTP servers.


I meant that as a suggestion compared to designing a new protocol.


Yes I'm old enough to remember when organizations had email but it was internal-only. Probably less for security reasons at the time than that they simply didn't have an internet provider. There were also mainframe-based email systems that were internal to that network.


You're making some fundamental assumptions about federation that I think are completely wrong. Are you telling me that you never need to communicate with anyone outside of your organization? How do you intend to receive invoices? How will you communicate with outside vendors? Sorry, but you need some text-based way of communicating with people and email is the best way, that's why it's survived so long despite being problematic. If you have internal, asynchronous chat, why would you need internal email?

Sorry my dude, but business runs on email. Saying lets get rid of it is as naïve as saying lets get rid of Excel. It's just not going to happen.


There are two ways to communicate with ppl outside your org without federation (this is covered on the website):

1) Set up a second TMTP service where customers and/or suppliers can create accounts, along with employees who need to interact with them.

2) Have some employees join a third-party service which is open to all involved in your field. There is a risk of phishing in this case, but:

. a) anyone you haven't previously contacted is limited to short, plain-text communications to you (see Ping in protocol), and

. b) such third-party services would typically charge a fee to members and impose a small cost per-ping, and

. c) you know that you're dealing with unknown entities with possibly malicious intent.

TMTP clients support active logins to accounts on any number of TMTP servers, just as browsers support multiple active connections to websites.


> To prevent phishing, any chat service will suffice; an open protocol isn't necessary, as you don't intend to engage with ppl outside your org.

The same could be accomplished with email if you only allow connections to the SMTP and IMAP server from within the corporate network. That is, nothing external can connect to those servers, which is fine if it's only used for internal communication.


I used to work in anti spam and we would call these FUSSPs.

https://www.rhyolite.com/anti-spam/you-might-be.html


TL;DR: Thousands of brilliant minds have tried to fix email for decades, and realized it can't be done!

Ahem, I'm trying to bury email, not save it -- not unlike Slack :-)


XMPP is supported by a large number of clients, but running a server and getting everyone on clients with comparable featuresets is a nightmare. It’s a cluster of disparate standards, and it’s overwhelming. I’m sure it’s doable if you have the time to invest, but it’s not straightforward if you’ve never done it before.

Matrix is pretty straightforward on the server side of things, but the client UX is invariably mediocre. Vector—the official client—exemplifies everything that is wrong with Electron apps. Slow, clunky, poor UI, poor platform integration. With the default home server, it can take seconds for a message to go through. At least it’s far more customizable than Slack; it has an option for everything, which, as a power user, I quite like.

I haven’t tried Mattermost, but it looks like some of the important features aren’t FOSS, at which point it’s just another Slack as far as I’m concerned. I’ll gladly pay for support, but for SSO? Meh, might as well stick with Slack; at least everyone and their dog knows how to use it. (This is, of course, an opinion that stems partially from ignorance; I haven’t actually tried Mattermost, and if I do, I might fall in love with it. But my time is limited, and I can only evaluate so many products in a day.)

Not that Slack is much better here: their threading system has so many UI/UX issues. Ever had a thread with hundreds of messages? For your own sanity, I hope you haven’t. Ever tried to send an image to a thread from iOS? It’s possible, but only by pasting the image into the text field; the normal attachment button isn’t available, and Share buttons in other apps can’t send to threads. And, of course, the recent uptime issues.


Element (formerly Riot/Vector), has improved loads over the years, and the default matrix.org average send time is around 100ms these days rather than multiple seconds: https://matrix.org/blog/2020/11/03/how-we-fixed-synapses-sca... has details. I suspect you (and the parent) may be running off stale data.

That said, Element could certainly use less RAM, irrespective of Electron - and http://hydrogen.element.io is our project to experiment with minimum-footprint Matrix clients (it uses ~100x less RAM than Element).


> it uses ~100x less RAM than Element

Wow – congrats!!

What have been the most important architectural decisions to achieve this?


Rather than storing state from the server in the JS heap, new state gets stored immediately in indexeddb transactionally and is pulled out strictly on demand. So, my account (which is admittedly large, with around 3000 rooms and 350K users visible) uses 1.4GB of JS heap on Element/Web, and 14MB on Hydrogen. It's also lightning fast, as you might expect given it's not having to wade around shuffling gigabytes of javascript heap around the place.


I've wanted to try something like this (on a smaller scale), but haven't had time. It's good to hear of an implementation that reflects my expectations. How long did it take you to migrate over?


it’s a entirely new codebase; probably best way to visualise progress is to look at the contributor graphs at https://github.com/vector-im/hydrogen-web


Cool, thanks for sharing!


It has, and I’ve been using it since its early days. I still use it. It’s still terrible, just slightly less terrible. And, no, messages don’t consistently send in 100ms on the default home server; there are regularly disruptions that cause significant delays, sometimes as much as 10-20sec. That’s a big problem for a federated chat platform.

Edit 1: I want to love it; the design is everything I could ever hope for in a chat platform. I even tried to contribute to Vector, but it was such a mess that I eventually gave up.

Edit 2:

> That said, Element could certainly use less RAM, irrespective of Electron - and http://hydrogen.element.io is our project to experiment with minimum-footprint Matrix clients (it uses ~100x less RAM than Element).

I'm not sure why this is a priority. Techies complain about RAM usage a lot, but if we have to choose between performance+power and a small memory footprint, we're going to choose the former almost every time. Take Telegram, for example: they have a bunch of native clients that perform amazingly well, although they do gobble RAM. Most of my technical friends use it as their primary social platform. It's not without issues, but it's really hard to go from something like Telegram Desktop or the Swift-based macOS Telegram client to Vector. And those clients aren't made by large teams--most (all?) first-party Telegram clients are each maintained by a single developer, if I'm not mistaken.


It's weird that you're calling it Vector when it's now called Element and it was called Riot for years before that.


The constant rebranding and confusion over Matrix/Vector/Riot/Element is another point of pain for me. It’s incredibly difficult to communicate unambiguously about Matrix with people who haven’t been following it for years.

Does Element refer to the ecosystem as a whole, including EMS? The primary client? The core federation? It’s not obvious from a casual visit to element.io. I suppose if I said “Element web app,” that would be fairly clear, but I’m still in the habit of saying “Vector” from the days of Riot.


Everything related to the company formerly known as New Vector is now called Element. The company is Element, the official clients are called Element (with suffixes Web, Desktop, Android, iOS) and yes, EMS is Element Matrix Services. This rebranding was done specifically due to the confusion brought on by the many previous names. More info here: https://element.io/previously-riot

The protocol is still called Matrix.


Yes, I actually like the change—I think they finally got it right this time. (The Riot rebranding was a mess.) However, it’s still frustrating when trying to communicate with people who aren’t following Matrix-related news. In my circle of friends, “Vector” remains more widely understood than “Element Web,” so that’s what I’ve been using.

Anyway, my point stands: Element Web/Desktop feels fairly unresponsive compared to something like Telegram Desktop. It looks so much nicer now, the UI layout is great, and it’s far more powerful than Telegram—yet, I can’t help but feel like I’m swimming through molasses even when dealing with moderately-sized groups. Try clicking around on different groups rapidly; you’ll likely find that you have to wait several seconds for the UI to update.


>XMPP is supported by a large number of clients, but running a server and getting everyone on clients with comparable featuresets is a nightmare.

XMPP is, well, extensible. If things don't match in the clients then that particular feature just doesn't work. These days all the clients pretty much try to match the feature set of Conversations. That applies to the servers as well. There is a server tester for that:

* https://compliance.conversations.im/

So the tooling is pretty good these days.

The great thing about XMPP is that basic messaging always works. That stuff is just too simple not to.


I recently finally found someone to try Element with and the experience has been great so far. (Except for the need to go through Google's captcha at some point.)

It even sent a 20 Mo MP4 like a champ, while Conversations sometimes chokes on not that high resolution photographs...


I can only offer my own personal experience: Matrix has been working well for me for a couple years now. However, I probably have a more narrow use case than you're thinking of.

I run a small homeserver and use it to communicate with a group of about 20 friends. Most of them aren't "technical" people. We use it mostly for chatting and image/video sharing. We never use live calling (audio or video).

There have been a few bugs in the mobile apps, but for the most part, everything has been working fine.

The biggest issue is the UX. It's not as polished as the big players.


This is actually the use case I've been trying to get to for some time. Unfortunately, I need it to "just work" to get my non-techy friends interested, otherwise they'll go right back to Discord.

Like I said, it's close, I just don't think it's there yet.


I'd say it's almost in "just works" territory for everyone except the person who has to actually administer the homeserver (me). I absorb a lot of the complexity for my friends.

The only thing that's a little cumbersome is requiring them to enter a custom server URL when the register/log in for the first time.


For competing with Discord, it seems like it would benefit from a more robust free offering to compete with Discord. Being able to create a free Discord server is great, and it is incredibly capable for most communities that don't need the fancy perks of Discord Nitro etc.

> The free Discord plan provides virtually all the core functionality of the platform with very few limitations. Free users get unlimited message history, screen sharing, unlimited server storage, up to eight users in a video call, and as many as 5,000 concurrent (i.e., online at the same time) users.

For a lot of small communities that aren't focused around commerce of any kind, Discord's free offering blows Element Matrix Services out the water. It's a non starter. If I could create a server with feature parity to Discord's free server, any new community I'd create I would definitely jump on EMS in a heartbeat, and I'd start trying to recreate communities currently within Discord, to be on EMS.

So like a very normal progression for Discord servers is that some niche sub-community wants to gather, and so they create a free server, and people join and there's all kinds of rich content that gets posted and curated and great discussions and then as it gets bigger, people running the community or people who want to support the community will boost the server with Discord Nitro for additional features like more slots for custom emojis (I can't communicate enough how important of a feature this is to Discord's success, even though it seems like minor window dressing).

That kind of model is what would justify a server starting to shell out money every month for EMS. I would note that Discord's pricing for this kind of level of community is tiered and not a per-user thing. You unlock more features based on how many users are paying for Nitro, going up a tier based on breakpoints of 2/15/30 Nitro Boosts per month. It doesn't cost more to have a tier 3 server if you gain more users. This is a big deal for fostering growth and unseating incumbent social networks (which is what Discord and Slack are).

Just some thoughts. I really want stuff like Element/Matrix to succeed!


Actually that’s the same with slack, each slack server has a unique url too.


    and use it to communicate with a group of about 20
    friends. Most of them aren't "technical" people. 
I'm insanely curious about the human side of things here. How did you get them to buy into this idea in the first place? That sounds like quite an achievement.

The non-technical folks in my life generally struggle with paths of least resistance (iMessage, etc) and it's hard to imagine getting them onto some alternative platform/protocol.


It did take some persuading. I think the main reason I was able to pull it off was ironically because they're not that technical. I bet most of my friends don't even know what Slack or Discord is. That's not to say they're dumb or anything - they just don't spend as much time online as one would think.

Previously, we were mostly using group texts or Snapchat/Instagram to communicate, so the biggest selling point was the fact that we can share full quality pictures and videos between iOS and Android people.


This is awesome. I have always wanted to self-host a Matrix instance as well, but I imagine it's going to be very hard to convince them to move over, from Telegram. Is there a blog post that I can read about homeserver setup? I am keen on seeing how easy it was, and keen on seeing what level of technical and financial resources you had to invest to get going.


That's interesting. Thank you for sharing that!


For my part, I don't have buy in yet (Because I'm not convinced Matrix is ready) but I think I could get it. I have 7 or 8 friends who do not use Discord except to talk with me and a few other friends that I know can be convinced to at least start using Element next to Discord. Once I feel like my homeserver is in a state that I can invite these non-technical people in, I'll be in the same place.


You bring up a good point, however, which is that we _could_ use open source, non-centralized alternatives for many of the online products we consume, but we choose not to, and so we increasingly become slaves to corporations that actively seek to narrow our choices. Another example of this is the push from big sites like Reddit to use their apps rather than just use a browser - it’s not about functionality, it’s about destroying the free and open web.


> You bring up a good point, however, which is that we _could_ use open source, non-centralized alternatives for many of the online products we consume, but we choose not to, and so we increasingly become slaves to corporations that actively seek to narrow our choices.

That doesn't happen for no reason. The vast majority of open source products I've used have terrible usability. I simply don't want to use them. I don't want to be beholden to corporations and walled gardens, but for me, the existing alternatives are worse in too many ways.


Or, or... and bear with me here... or, packaged click-button solutions with paid (contractually obligated) dedicated product support is a better use of our short time, more often than not.


That only works if you only need to use Slack alone or whatever. The moment you have to use more of these annoying services at once and manage N different stupid client apps for Y different platforms (desktop/mobile), the lack of open/shared protocol becomes a major issue. Let alone if you want to use them on emerging mobile OSes that are not a hellhole of data thievery.


>support is a better use of our short time, more often than not

Not when it's down.


Which open source solutions never go down?


Everything goes down. But it looks like huge complicated distributed services shared by huge amounts of people, that are continuously updated and developed, and are constantly trying to attract more users/load, seem to go down more than a simple service on a simple server.

No hard data though. My mail server only ever went down when I upgraded the server and didn't check that everything was still working right away, or similar maintenance induced incidents. It never went down by itself.

Such systems only ever go down unpredictably on HW issues, or when overloaded/out of resources. Neither is very likely, because you're not trying to grow your service in any sense similar to VC backed enterprises. Most of the time it has constant very low load and resource use. And you can simply stop introducing changes to the system if you need more stability for some time. (stop updating, for example)


The one solution with PLANED downtime.


Facebook and other vendors killed XMPP, we lived in a non federated world in Enterprise and consumer. No interest of companies to change this


XMPP killed XMPP. Its just not very good. It doesn't work well between different clients and servers. The protocol is a horribly overcomplicated mess of overlapping, partially supported extensions for basic functionality. And it doesn't work at all with low power mobile delivery. (It was invented before the iphone.)

There might have been political reasons why google dropped XMPP, but it would also make sense as a purely technical decision.


> And it doesn't work at all with low power mobile delivery. (It was invented before the iphone.)

This is plain untrue. Yes it was invented a long time ago, but thanks to the extensibility it has evolved over time just as the way people use it has changed. This evolution is a healthy and necessary part of an open ecosystem.

I know it frustrates people that modern features don't work in stagnated clients such as Pidgin and Adium, but modern clients support all the things you would expect.

Servers and mobile clients have supported mobile-friendly traffic and connection optimisations for many many years now.

> There might have been political reasons why google dropped XMPP, but it would also make sense as a purely technical decision.

Google contributed extensions to XMPP, the same way they contribute to other internet standards. I think they were quite comfortable with this. The XMPP-based Google Talk was their longest-running messaging solution after all...


> And it doesn't work at all with low power mobile delivery.

What makes you think so? If Conversations was draining my battery, I would have noticed by now, I'm pretty sure that Facebook Messenger is worse in this aspect...


Maybe things have changed - certainly when I looked at it a few years ago (around the time that google stopped supporting it) my understanding was that xmpp had no push notification support. The app in the phone had to either poll or explicitly hold open a TCP connection. (Which is problematic when the app is backgrounded.)

Has this been fixed in XMPP?


Yes, XMPP has had push notification support for years. It's the only way can work on mobile these days.


> It was invented before the iphone

That's true for email also. So that's not an evidence for anything lacking in xmpp. The plain fact is that Google killed it due to their greed.


I was recently forced to use Facebook Messenger (thanks God it's soon over), and I'm hating it : it's slow on mobile, even worse on PC, where it regularly makes my whole OS hang requiring a reboot.

Scrolling back is atrociously slow, and it doesn't even seem to have a search feature !

I'd take XMPP alternatives like Conversations, Jitsi, Pidgin any day ! (And Element of course.)


XMPP is hardly killed. There are tens of thousands of XMPP servers out there with over a hundred public servers. There are lots of client implementations. Even the really bad implementations manage basic messaging.


Spam killed xmpp


I remember using clients like pidgin with all my accounts it was a great experience. Now I need to have like 100 apps


BS

The only thing "killed" XMPP was that proprietary made money, XMPP didn't.

Apart from that, it's alive and well. See Conversations for android, Prosody for server.


What are you talking about ?? You need to confirm each other before communicating on XMPP, spam can't get through that !


I can hardly see those as alternatives to Slack. Maybe https://mattermost.com/ is what you were thinking about?


Matrix with Element (Riot) as the front-end is pretty close. It does what slack does, it's just not very good. XMPP is arguable. It can be a Slack alternative, if you stitch enough other servers on top of it. Personally, I don't think XMPP will ever be more than chat, but some of its adherents believe differently.

Mattermost is certainly not what I meant. That's just trading one Slack for another.


This thread is about a Slack outage, which you have no control over. Mattermost and similar software is self-hosted, which of course doesn't mean you're getting 100% uptime, but you have (more) control over it.


In practice self hosted usually translates to more downtime and slower performance when it works. Unless your org has more expertise running a chat service than Microsoft or slack, your self hosted alternative is always going to suck more.


Did you try self-hosting and it lead to more downtime and slower performance?

From my experience, when I self-host stuff it's a lot faster (more server resources) and never had any downtime (server doesn't simply go down for no reason).


I haven't tried it personally. But my employer hosts an on-prem Github instance and it is just terrible. So many downtimes, long times before anybody gets around to repair it, general performance issues, maintenance windows for upgrades, etc. Just a huge pain. I've seen this sort of problem with the old Exchange on-prem services too.


IRC may be out these days, but at least deploying a small IRC server for the own team is really not that much effort anymore and doesn't incur that much ongoing maintenance work either.


RocketChat works pretty well for simple team comms. I have no idea if it can do XMPP and/or Matrix.


I suggested RocketChat when the outage was announced and HN community downvoted it quite heavily. I'm not sure why. [0]

We ended making the switch and committed to Discord. We're now looking at Rocket.chat as a backup in case Discord goes down. But Slack is now completely out of the picture for our team.

[0] https://news.ycombinator.com/item?id=25633047


Just curious - why not use Matternost as a backup? (disclosure: I work at Mattermost, but really just want to know what you think)

I’ve advocated for an idea where Mattermost is to be used as a “bunker” where it is hosted on a raspberry Pi (or somewhere else) and acts as a digital bunker if your critical infrastructure (slack, teams, exchange?) is compromised somehow.


Not OP. Good idea. I thought it's integrated into GitLab (on premise omnibus), but I still haven't fiddled with it, but enabled something in the config file, but nothing happened.

I know it's a tough spot, but if it were usable from GitLab with zero config that would be great for fallback.


@pas: Mattermost is indeed integrated with GitLab Omnibus.

To enable Mattermost, you can add the Mattermost external URL in the config file, and run `sudo gitlab-ctl reconfigure`. I'm wondering if that's something you've tried? https://docs.gitlab.com/omnibus/gitlab-mattermost/#getting-s...


Thanks for your reply! I gave it another try, and it works now beautifully :o (Possibly last time I tried it, there was no LetsEncrypt integration and the external URL setup was more involved?)


Interesting! I never looked into it, but now I will!


My experience with RocketChat is that it works quite well on the surface, but after using it for some time, some very annoying bugs emerge:

* You get notifications for channels for which you have suppressed those notifications

* Some channels are marked as having new notifications, when they haven't

* Notifications for new messages in threads you are involved in are quite hard to find (horrible UX)

* Some UX choices are very confusing (you get a column of options related to notifications, and for some, the left option is the one leading to more notifications, for some the right option)

* There are some overlapping features that lead to inconsistent usage (channels vs. discussions vs. threads)

* Threads are hard to read, because follow-ups in threads are shown in a smaller font size. You cannot increase the font size at all in the desktop application

.... and so on.

Also, I tried to submit some bugs, but for that I'd need to have some information which only the admins have that run this instance, and in the end it was too much effort to get all that information together, so I didn't even bother.


I agree, it's still got a long way to go. I'm saying it's still a perfectly viable alternative to Slack. Fast, simple, works. (At least this is my perception/impression. I wanted to evaluate Slack alternatives for some time, but haven't got the time for it yet. So I was surprised when I got an invite to one of our client's rocket.chat instance and things worked pretty well.)

I'm in a slack workspace that is constantly notifying me of a thread, but I can't make it read. Maybe it has something to do with the free messages limit. So the message is there, but cannot be accessed. Annoying as hell. I thought about submitting the bug to Slack, but then just let it go, and probably we'll just move to Signal or something.


I've heard lots of good about Zulip - haven't tried it myself yet though


Have you tried http://quill.chat/ ? Younger startup (invite-only) but very slick.


Looks like a great start. The TODO list is most of the stuff I'd want that push it over the edge of even suggesting to our company to replace Slack.

But I signed up and will keep it on my radar as it matures.


Thanks for sharing! They definitely nailed the marketing page, I’ll keep in my list of products to follow up on :)


If the benefit we are looking for is better up time, that will not happen. The main benefit is going to be knowing why the system is down, and the eta to being up again.


That takes care of the software and protocol side of things, true, but does it give more reliable and predictable uptime? That's the main thing here; while there are plenty of software alternatives to Slack, their product is not just the software but also the hardware, servers, and scaling. You can get a Slack instance from 10 to >10K members without ever having to worry about your hardware, or how much hours your staff needs to spend on maintaining said hardware. And when there is inevitably downtime, you and your staff don't have to scramble to get it back up - with this outage, it's a shrug, it's down, it'll be back soon probably, I'm going to do some work or do something else. Extended toilet / lunch break.


> self-hosted

How often is Slack/Discord down? I mean it's not perfect, but I really honestly don't think I could match their uptime by self-hosting, as well as more on-call rotations for something that's not core product.

I very much prefer that for something that isn't core product, if it goes down I need to do exactly nothing for it to come back up, and that the engineers at Slack will be starting to work on it likely before I even realize it's down.


This is a tale SaaS vendors (which have strong presence in online tech communities like HN because they are software companies) sold very well, and it's probably true for many small startups, but for medium sized companies managing their own platform for something like Slack is completely doable and you will not have those big downtimes compared to Slack. Sure, you have to dedicate time and resources to it, and obviously is not "core business" although a chat platform is a pretty important component in an online company.


I would be surprised if you couldn't match or exceed slacks uptime running whatever alternative you want (IRC, mattermost, rocketchat, etc.) on a random dedicated server.

Hardware is quite reliable these days. And updates can be scheduled to be at a convenient time for the team.


Yes, but what if you're taking a few days off to backpack in the wilderness with no signal while it goes down? Who deals with the downtime?


If you are the only technical person on your team then it's of course not ideal and would require some further thought into making things redundant. But even that is easy enough to do with IRC (setup two servers, link the irc servers together, single DNS record that points to both servers - job done).

If there are other people on the team that have _some_ technical skills then they can fix it..

IRC lacks quite a few features compared to other solutions, but the reduced complexity does bring very low operational complexity.


IRC will be incredibly hard to use for non-technical people on your team. Mobile clients for IRC look like crap, and have horrible-looking ad bars. No integration with Google Drive, Github, or other things.

It's just not a business-friendly tool.

I'm an engineer and personally I'm fine with IRC, I'm just trying to be realistic here.


Is it really though? If I take a look at a random modern IRC desktop client - how is it more difficult to setup than say your email program? The amount of information needed on setup is about the same: server, username, password (in fact email can get a bit more confusing in big corporate email setups with differing imap and smtp servers, etc.)

Also there are plenty of modern web clients for IRC, such as https://thelounge.chat/ or https://kiwiirc.com/ (which is supposed to work on mobile too).


> your email program

Reality check: Most people don't use email programs anymore.

Also how do you get IRC to sync all conversation data, history, between your several desktops and phones, how do you send files, make calls, and thread conversations?


But clearly no one is saying that email is too hard to use and we should just use $something_else (or are they?).

And you are starting to move goalposts here.. first it was uptime, then it was operations and now it's features...

And what about those web based IRC solutions? They are even easier to use than slack, have combined history, file sharing, etc.


They are moving the goalposts because there are several and ultimately very many reasons why IRC won't work, they just didn't bother to think of all the reasons and list them at once.


Ultimately there is only one reason that matters: The person in charge of deciding what communication channel to use likes Slack/Teams/IRC/whatever.

Add to that the SaaS propaganda that hosting literally anything yourself is just too hard (it really isn't). Or this notion people are just too stupid to deal with anything more than the simplest possible web interface - Really? what do those people even do? Stare at Notepad all day? Of course not. They stare at various complicated software packages ranging from CAD, $spreadsheet abominations, SAP to various Adobe software packages. Sprinkle in a bit of hype for the latest new thing and presto.. </rant>


> Reality check: Most people don't use email programs anymore.

Guess you are not in enterprise.


I'd bet hard money that within epsilon of anyone using a desktop email client in 2020, and thus having one to set up in the first place...is in an organization with access to Microsoft Teams.


Who deals with the downtime if any other on-premises system goes down?

If you are running networks and software on site, and they are business-critical, you have people and a plan for this. Or you don't, and suffer the consequences.


There will always be more downtime on Slack/Discord. There are more users, more updates. Slack/Discord is a giant distributed system with nodes all around the world. An IRC/XMPP server on one machine that 100 people use is not going to crash unless intentionally.


People really over estimate the difficulty of running self-hosted systems with great uptime.

When self hosting you can get away with simpler systems that ends up being more stable and have higher up times for lower effort.

The reason you see cloud providers having issues is not because the thing is difficult, but because doing anything at huge scale ends up being difficult.


> it's not perfect, but I really honestly don't think I could match their uptime by self-hosting

This is such a common misconception. The services I self-host was configured by me, if anything goes down (which they very rarely do), I know the exact cause and have it fixed in minutes. When some company's cloud service goes down I'm completely at their mercy. I also spend very little time on maintaining these services, just security updates, which are mostly automated.

Bottom line, maintaining and self-hosting services that has 1 or a few users is much less complex than services with millions of users. Hence, my uptime is better than Google's, Amazon's, and Azure's, etc.


Yep, I've stopped recommending Matrix because

1. There is virtually zero user-facing documentation. Need to know how to backup keys, verify another user, or what E2EE means? Ask your server operator. Basically the onus is on operators to document this stuff for their users. Except the stuff we're documenting is hard even for server operators, and especially challenging to document in a way that both nontechnical and technical users can understand.

2. Because this stuff is challenging even for more technically minded users to understand, it leads to a kind of burnout for interested non-technical users where they learn all they can about some feature and how it works at a high level from out of date random blogs, try to use the (complex, multi-step) feature, but then something won't work, and it isn't be clear whether it was because the user did something wrong or because the clients or server implementations are broken

3. Issues where core functionality is broken (e.g. two mutually verified users on my homeserver haven't been able to talk to each other in months -- see [1], [2], [3]) languish for months with zero response from maintainers.

4. While core functionality is both broken and undocumented, the maintainers announce rabbit hole features that no one asked for and seem very much like distractions, like their recently-announced microblogging view/client[4]

In short the Element maintainers have shown little interest in making the platform accessible to the people who need its differentiating features the most, and have prioritized the "mad science"/technical aspect of their platform at the expense of the human element (end-users and operators).

It'd be cool if Element used their resources to hire some UX folks and community advocates whose sole focus is addressing the horrid accessibility of their platform. I think most users would rather see that than further "mad science".

[1] https://github.com/vector-im/element-ios/issues/3762

[2] https://github.com/vector-im/element-ios/issues/3572

[3] https://github.com/vector-im/element-ios/issues/3393

[4] https://matrix.org/blog/2020/12/18/introducing-cerulean


Have you submitted the requested bug reports?

Also, it seems the FAQ answers several of your points: https://element.io/help


Yep, I have, although funnily enough it turns out that the rage shake feature was the only way to submit a bug report with diagnostics from a client (as of a couple of months ago anyway) and that feature itself was broken for one of my users (who has since churned).

That FAQ is a great start, but it's not sufficient for non-technical users. It's not easily searchable, it doesn't provide screenshots, and it doesn't go into enough detail for each item (e.g. describing what can go wrong + troubleshooting).

Thanks for pointing it out though


This is disturbingly good summary. I remember Matrix being presented as less bloated compared to XMPP... sure.


it's trivial and essentially free (<$5) to run an IRC server supporting tens of thousands of users.

it also doesn't take gigs of memory on client devices.


There is also Mattermost which is literally like Slack, but self-hosted.


Remember when Facebook messenger was xmpp based? Lol.


Also Google Talk.


I am genuinely surprised that Slack wasn't ready for people to come back from holiday, to view increased queues of unread messages, to have to manually login vs. having auth tokens or cookies, etc. Either that, or they had a cosmically coincidental outage on a really bad Monday to have it.

It's bad enough team comms go over Slack so much now, at least we have email fallback. What scares me is for the teams that use Slack for system alerting.


My coworker's theory was someone was waiting for the holiday's end to deploy something risky.

And I'm in that boat of depending on Slack for alerting... in fact my team was also waiting over the holidays to deploy more robust non-Slack-based alerting (in our defense the product is only a few months old and only now starting to scale to any real volume).


I wouldn't be surprised if it's actually a combination of a new feature being recently rolled out, along with the sudden spike in load this morning.

The holidays are actually the perfect time for Slack to roll out a risky deployment, as it has to be their lowest usage time. So it would make sense if something was pushed out last week or the week before. And everything probably seemed fine.

And then this morning they suddenly realize this new feature does not perform under load. And to make matters worse, the new feature has been out long enough to make any sort of rollback very tricky, if not impossible. Which means they'd need engineers to desperately hack out, test and deploy a code fix.

If this is the scenario, I do not envy them at all.


Holidays are a good time for a company to do a risky deployment, but a bad time for an individual employee to do a risky deployment, assuming one doesn't want to work overtime over the holiday fixing things.


Depends on how well compensated holiday overtime is. There are some employees happy to work overtime if their hourly pay is doubled or tripled. However there also those who wouldnt do that for any price.


Depends how bad it goes wrong. My org is a 24/7 one, but one Christmas back in the 90s (way before my time) some work was done on Christmas eve, I think it was on the phone system, in the days before widespread mobile phones.

It broke, which was a major problem, this meant that senior management were being phoned (ho), and relatively high middle managers were on site to deal with the fall out. Of course most suppliers were also closed so everything was harder to fix.

There's good reasons not to do changes when places are closed, or at least skeletoned, for 2 weeks.


This depends on how easy/difficult the rollback strategy.


Not a bad theory.

I used to work for a place that had a FY that ended in summer. We had a lot less problems with stuff being shoveled out the door at Thanksgiving and Christmas because nobody was trying to finish their year-end performance goals over the Holidays.

I think what I'm implying is that management creates this issue, but we are complicit.


Yeah, I think it's this rather than load. Slack should be able to handle load fine (probably), but since this is the first weekday post-holidays I imagine some deployment broke something.


Slack has been in business for several years and has survived several December to January transitions, including several people stopping using their product before Christmas and then returning early January.

It seems a bit presumptuous to assume that's at fault here, given their age.


Does it? Don't you think their users might be leaning on it more heavily this year due to working from home?


The two cliched sources of this problem are 1) someone pushed something out over the holidays that could have waited until January, or 2) peak capacity was negatively affected since the last time a spike happened, nobody had a way to monitor it, and so this has been broken since the end of May. On further reflection, someone will admit that they noticed a notch-up in response times and did not connect the dots.


It might have been the increased usage due to the pandemic (since it didn't happen from 2019 to 2020) + the sudden inflow of people at the same time.


This would be a good explanation for an outage in March or April 2020, not so much in January 2021.


There may well have been a bigger delta in usage from Sunday Jan 3 to Monday Jan 4 2021, than for any particular pair of days in March-April 2020.

Of course 2020 saw an increase, but it was smeared over a week or a month rather than being a big jump in a single day after everyone's holidays.


We get our primary alerts through Slack. However we also have SMS and phone call backups through PagerDuty


(from Opsgenie) I would imagine it would be the other way around for most people.


We have an alert channel in Slack, but it's mostly ignored. Our primary alerts come via SMS/VictorOps.

At one of my old jobs, we had SMS via two physical/hardware devices in our data center. One had a Telstra SIM card and the other had an Optus SIM card. (They were plugged into the same machine, but we had plans to put a second one in another data center before I left).

If you really care about alerts, you should have physical hardware doing your SMS messages via two different point-of-presences.


Now is a good time to recommend to your engineering org that they should have multiple alerting methods, e.g. Slack plus Pagerduty, or Slack plus email.

Hopefully email won't be your backup. I've seen that done. Alerts get filtered and ignored, often by accident.


Do you know that was the root cause or are you making an assumption and running with it?


Having been in this situation before, with a totally-down-and-not-coming-back-up outage of a payments system, I really feel for their incident response team.

I'll take this moment to remind everyone of their human tendency to read meaning into random events. There's no evidence to suggest New Year traffic has caused this, and outages like this can happen in spite of professional and competent preparation.

Hugops for their team, I hope they get it back soon.


> I'll take this moment to remind everyone of their human tendency to read meaning into random events. There's no evidence to suggest New Year traffic has caused this, and outages like this can happen in spite of professional and competent preparation.

On the one hand, sure we don't specifically know what's going on. On the other hand, it's the first Monday in the new year and they went down shortly after the start of the business day Eastern time; it could be coincidence, but it would be a remarkable coincidence.


There are a load of ways NY might have contributed to this, but it may not be a direct cause. What's more likely, Slack forgetting to scale their deployment back up after too much mulled wine, or a number of people on holiday meaning a simple failure has developed into something more serious?

It could be anything really- my post was more about how situations like this can happen to even the most prepared. The assumption it has something to do with NY tends to assume very trivial, silly mistakes. Especially with no information, that seems a bit uncharitable.


It seemingly worked ok in UTC-2 in the morning and early afternoon, then started having issues and is now a bit intermittent (or fixed, there's not much traffic on my channels, as it's evening already). Do they have that much more traffic on US east coast than in Europe?


Probably, but it was only 2-3pm UK time when it started falling over so there would be all the Europe traffic plus the East Coast traffic starting to sign in.


Another likely scenario is that they a deployment that was risky that they waited to push until after the holidays.


At GitLab our fallback from Slack is Zoom https://about.gitlab.com/handbook/communication/#emergency-c...

I'm posting this because I found a lot of people don't know that Zoom includes a complete chat client that includes channels.

And #HugOps to the engineers at Slack working on this. I appreciate that they posted a periodic update even when there was no news to report: "There are no changes to report as of yet. We're still all hands on deck and continuing to dig in on our side. We'll continue to share updates every 30 minutes until the incident has been downgraded."


I'm quite surprised that you don't use Mattermost as a 'Slack fallback' at GitLab.


Indeed, if you champion FOSS, why would you recommend a proprietary piece of software as fallback?


My point was really that it's GitLab's own product. "GitLab Mattermost" [https://docs.gitlab.com/omnibus/gitlab-mattermost/]

I'm amazed they use Slack at all. Let alone as a fallback.


No I'm pretty sure that's just a sort of 'integration', Mattermost shipped with GitLab?

https://about.gitlab.com/blog/2015/08/18/gitlab-loves-matter...

> Like many companies in the last year we've switched to using Slack to improve internal communication. [...] Since Slack doesn't offer an on-premises version, we searched for other options. We found Mattermost to be the leading open source Slack-alternative and suggested a collaboration to the Mattermost team.

I'm not really sure why it's 'GitLab Mattermost' and not (at your link) 'GitLab Nginx' et al. though.


Ah I see now, having read more of the history. Calling it that seems pretty misleading/odd.

(We use GitLab and Mattermost (integration) where I work. I've been 'remote / WFH' for the past 7 years.)


I agree it can be improved and created https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_request... to do so.


They posted a giant list of the services they use recently.

They use a ton of services.

Likely you don't want your backup to be one of your systems and another part of the company probably uses Zoom already so it is probably easy to fail over to that.


Here is the list of services that we use https://about.gitlab.com/handbook/business-ops/tech-stack/

This includes many proprietary ones, we generally choose the product that will work best for us, considering the benefits of open source, but not excluding proprietary software.

Mattermost is not part of the single application that GitLab is. There is a good integration between with GitLab and our Omnibus installer allows you to easily install it. But it is a separate application from a separate company.


Or use mattermost with Slack as the fallback


Maybe my DevOps folks should not be privy to all internal communications?

That is one reason we did not go with Mattermost.


If you do not trust your own devops, why are you trusting someone elses devops?


It's just due diligence. Think of what you have access to if you have "god mode" on corporate chat: HR, the CFO's DMs, private messages between other coworkers, and so on. Most won't fall for this temptation, but even those with strong anti-spying morals can be weakened by circumstances. Best to remove the temptation by design.


Because someone else's devops can't use it against you institutionally. Nor is going to insist on having an opinion on things that they're unaffected by.

This isn't a slam at devops, it's about the need for institutional information hiding; not everyone needs to know about and weigh in on every decision being made.


We structure our company similarly. With effort DevOps is god on everything except HR, Sales, Finance, Chat, and C-level management which are operated with 3rd party services controlled by the individual departments and "owned/managed" by the C-suite.


Low maturity risk management functions.


Slack won’t protect you from this as it’s possible for admins to export even private DMs.

https://www.nbcnews.com/better/business/slack-updates-privac...


Only Slack Workspace Owners can export, not Slack Admins.


You also have to be on the ‘Plus’ plan, otherwise it is a roach motel.


Devops should have nothing to do with your chat server. It should be your IT department, just as with the email server.


DevOps at a lot of small companies also manage the internal IT stack and sometimes even take on most of the IT duties. Once you get larger you start having "IT" as something separate from DevOps but with the actual infrastructure managed by operations. Once you're really big the teams are truly separate and IT owns their own infra.


I'm guessing because they don't want the support burden on a rarely used but necessary fallback solution vs. something plug and play.

This is the reason these "closed" ecosystem apps like Slack/Zoom are multi billion dollar companies and have massive uptake. Simple and easy to user.


As someone who has to use Zoom Chat to interact with a client on a daily basis, please, do not recommend Zoom Chat to anyone except as an example of how not to do chat software.

--

Though, I do agree wholeheartedly with your sentiment that the Slack team needs all the positive vibes they can get right now.


As someone who has to use Zoom Chat every day, this a thousand times. (We still run an XMPP server on the side just to avoid the horror that is Zoom chat.)


Yeah, this is baffling. I'd rather run turn of the century ICQ than Zoom chat...


I continue to be impressed by GitLab's operations and documentation! While, yes, others may have similar backup plans, as an outsider, it feels like GitLab's handbook seems cooler even if only for their publishing, and making public of their practices and processes. I'll caveat that I'm not really a fan of zoom/slack/hangouts (I'm an unashamed fanboy of matrix and its numerous clients), but gitlab's approach is still really neat! Kudos to gitlab!


Aren't you worried about so many security vulnerabilities found in Zoom?


Unpopular opinion, but WebEx beats the pants off Zoom. Of course, it's neither free nor open. But it does support strong end to end encryption and authentication and has regulatory compliance to a bunch of things, if that's important to you. I get that there is WebEx hate because "enterprise" etc, but we use it around here and it works quite well.


+1, both are a PITA but Webex at least has a really good web client.


Nice tip! Zoom chat is cool (although chats without gifies are way too productive ).


Oh wow -- after years of using Zoom I definitely did not know about this. Thank you for pointing it out!


Is this accessible from the web? Can't seem to find it with this Chromebook.


At Papa we use Discord as backup.


Why not Mattermost or Flock?


Just a reminder that it's probably not a wise idea for anyone to get further in bed with Zoom than they already are.

https://www.washingtonpost.com/technology/2020/12/18/zoom-he...


"Business takes the easy and ethically questionable route to continue making money" news at 11.

I'm not condoning Zoom's actions but this is hardly a problem unique to Zoom. Few if any businesses will stand up for consumers and citizens unless it's directly aligned with their profit motive. In this case, the business choice is to operate or not in mainland China. If they choose to stand up against the Chinese government they're going to have difficulty continuing to operate in China and risk losing that entire market.

Google played this PR game many years ago in China (rejecting some of the governmental policies) and ultimately caved to Chinese policies to do business there.

Businesses are not the organizations we should look to for empowering people, that's simply not their goal no matter how much their marketing team may want to sell that idea by following trending (popular) social movements that they've already done market studies on to assess potential fallback.


I think it's a pretty bold claim to state that Zoom's actions aren't unique.

What other business in this space has given China unfettered access to US users and data? I'm not aware of it occurring with Webex, Teams or go2meeting. The "one rogue employee" thing falls flat pretty quickly when they're the only ones that had this issue.

This feels like their encryption thing all over again, there's an "oversight" that is equivalent to a backdoor that only gets fixed when they get caught.


I didn't realize they shared any user data outside China (misread the WP portion). It appears they did share 10 users' data which is a bit questionable but I'd hardly call that unfettered access to US data.

The fact is all of the US businesses operating in China give surveillance ability to the Chinese government for the Chinese users and are operating in an ethically questionable space being primarily based outside of China, at least in my opinion.

It's really not too different than the businesses sharing US citizen data to the US government, much of which Snowden and others before him exposed. I suspect there's a lot more surveillance going on everywhere than the general public know about and the businesses best positioned to do the surveillance are probably doing it.


And yet a surprising number of firms with sensitive info continue to use it. Law firms etc


Is this an attempt to refute the claim using Zoom is bad, or an indictment against those still using it?


an indictment that so many people who should know better, still use a tainted and non-benign product.


Elaine Chao’s sister is married to Xi, while Elaine, as transportation secretary under Trump, was busted inviting family with business ties to the CCP to official US government meetings.

The fear on this forum is imagined political thriller more than realistic.

Every technologist is grifting off the military industrial complex.


The lesser of two evils and the product just works. They might have a few governance issues they need to fix. But at the end of the day, they signed a BAA with us and will take the liability and fallout of a breach.


Imagine if they did that for the US government, which is easier to compel since they are in US territory.


One nation is currently operating concentration camps and arrests and seizes the property of prominent citizens who criticize the government. Are you sure that's an equivalence you want to draw.


Like Guantanamo Bay or prosecution of Assange for his journalistic work to expose wrongdoing of government? Or maybe you’re talking about for-profit prison system and mass incarceration practices? But you’re probably talking about China, right?


Once again, that is a false equivalence.

No one imprisoned in Guantanamo Bay is a US Citizen and neither is Assange.

The US prison system is super fucked up but it is not the same as ethnic cleansing.

You are comparing apples to concentration camps.


> No one imprisoned in Guantanamo Bay is a US Citizen and neither is Assange.

I think you should I know I -- and probably others, are reading this as "b-b-but, they're not US Citizens, so they don't deserve [the same] rights"

I hope that's not what you mean, because if it is, that's really fucked up.


That's exactly how I read it. And that's probably the same position of lots of Americans, which in and of itself is quite fucked up.


We have thousands of brown people in camps along the border, in brutal conditions, without access to healthcare(unless you count forced sterilizations as healthcare). Do you consider those to be apples as well?


That forced sterilization claim was entirely debunked and was misleading to start with:

https://www.channel4.com/news/factcheck/factcheck-were-mass-...

https://www.snopes.com/ap/2020/09/18/more-migrant-women-say-...

And 70% of those people in those camps are released within 30 days, often times within one week back to their country of origin (or given asylum).


Why are they in camps along the border? Why are the Uighur? Did the "brown people" break any laws? Did the Uighurs?

Are the "brown people" in camps along the border a single, ethnic minority? Are all "brown people" in the country subject to arrest and under surveillance just for being "brown"?


Well, yeah. People _are_ subject to arrest and surveillance for being brown/black in the US.


Not really more surveillance than anyone else. And the discrimination and mistreatment for people in the US is bad but nothing compared to camps.


No they aren't.


> No one imprisoned in Guantanamo Bay is a US Citizen and neither is Assange.

That's a glib retort.

A takeaway from your position is that it's ok so long as you do it to citizens of other countries.

> it is not the same as ethnic cleansing.

See the above.

That's always been the difference between the US and China and why so many countries have hatred for us and yet little to none for China. They don't fuck with other countries on the level that we do.


yes, I remember when I got my trump kidney from a poor anti-fascist liberal. /s

America is fucked up, that doesn't mean that other countries aren't also fucked up or aren't doing worse things with the data they collect.


Yea, but you live here and so you should think about the implications of this for yourself and your countrymen and not through the lens of international competition. That is a distraction.

Essentially, the China case proved Zoom is willing to cooperate with a nation state. The US is the nation state we live in, Zoom is HQ'd here. Therefore, the risk to us is high.

As an aside, the organ harvesting idea comes from the Fulan Gong, who are similar to Chinese Scientologists. It is not clear to me that their claims are accurate.


Yes. The Chinese state and the US state are both proven to spy on their citizens. For reference, see the heroic Edward Snowden's 2013 leaks.


What an overwrought headline, the employee in question has already been fired.


Sorry, but an executive is not just "an employee" and any alarms are rightfully justified. Took a little bit of cajoling in my company but we've successfully moved to self-hosted tools for the most part (Jitsi and Rocket.chat) with just a couple of projects with outside contractors using Slack.


It's weird that you describe the headline as "overwrought" and call the person an "employee" when the headline is more accurate than you.

This was an executive, not just an employee. That's a huge distinction and I can't help but think you intentionally downgraded his position to cover-up his behavior. "Just an employee" "Not a big deal"

But when you read the allegations, they seem like a very big deal that an executive was spying on users, giving their information to the Chinese government explicitly for oppressive purposes, including folks who are not in China, and went out of his way to personally censor non-Chinese groups meeting to discuss the Massacre-Which-Cannot-Be-Mentioned.

I would say the headline understates the gravity (it's very much a 'by-the-books' headline that you KNOW went through ten levels of Legal), and that your hand waving here feels much more dishonest than the headline.


Regardless of intent, it's undeniable that at some point there were insufficient controls to prevent this executive, or any executive in the future, from gaining this level of surveillance access.

And it's also undeniable that the consequences for Zoom (really, just needing to fire a few people, and not even the people who designed those controls if there were any) are so minimal that they have no incentive to strengthen those controls.

For some organizations (mine included) the benefits of Zoom outweigh the risks of Zoom having proven itself to not have those controls, namely the possibility of both political and corporate espionage. As with all things, YMMV.


Not only that, but this line stuck out to me.

> and other employees have been placed on administrative leave until the investigation is complete.

Zoom at least suspects he did not act alone.


It was an executive purposefully brought in for legal compliance with that country's requirements. That he was fired is a huge signal in how seriously aggressive zoom is about protecting data that they would even be willing to go up against national governments. I feel like the firing is a huge part of the story.


The optics are still very, very bad for Zoom. I have zero trust in them.


There are remarkably few organisations I somewhat trust (even then on a sliding scale) but on that spectrum Zoom sits at the "wouldn't touch them with someone elses bargepole" end.


The company in question is still operating. We don't know if the employee was just a scapegoat.


Please reconsider using or supporting Zoom in any way. https://www.nytimes.com/2020/06/11/technology/zoom-china-tia...


While Slack is down, let's remind ourselves that it is not the end of the world. To their ops team, good luck in sorting out the root cause(s), to mitigating their re-occurrence, and to emerging the other side a stronger team. You've got this.


Hopefully they have a backup system for internal comms


It is not end of the world if you are just using slack for intra team communications.

However lot of the monitoring which alerts on slack and other automatic notifications are critical for many teams.


Critical systems dependent on another system are just as reliable as the third-party system; so this may be a good wake-up call for many.


I have always had this fantasy thinking of what happens when outages of one of these major service never come back online i.e. in this outage Slack loses info of all the accounts, users, messages etc.

How would people react? What would engineers do to recover? I always found that idea fascinating.

Imagine Google saying tomorrow that they lost all the accounts and emails. What kind of impact the world will have?


That scenario is what Disaster Recovery plans are for. Every large company I've worked for has had recovery plans in place, including scenarios as disturbing as "All data centers and offices explode simultaneously, and all staff who know how it all works are killed in the blasts."

You not only have backups in place, you have documentation in place, including a back-up vendor who has copies of the documentation and can staff up workers to get it up and running again without any help from existing staff.

And we tested those scenarios. I'm not sure which dry runs were less fun - when you got paged at 3 AM to go to the DR site and restore the entire infrastructure from scratch... or when you got paged at 3 AM and were instructed to stay home and not communicate with anyone for 24 hours to prove it can be done with out you. (OK, so staying home was definitely more fun, but disturbing.)


This scenario isn't as far fetched as people think. I was running a global deployment in 2012 when hurricane Sandy hit the east cost. The entire eastern seaboard went offline and was off for several days. Some data centers were down for weeks. Our plan had covered that contingency and we failed all of our US traffic to the two west coast regions of amazon. Our downtime on the east cost was around two minutes. Yet a sister company had only one data center in downtown New York, and they were offline for weeks, scrambling to get a backup loaded and online.


I worked for a regional company in the oil and gas industry and the HQ and both datacenters were in the same earthquake zone. A twice per century earthquake had a real risk of taking down both DCs and the HQ. The plan would have been for every gas station in the vertical to switch to a contingency plan distributing critical emergency supplies and selling non-essential supplies using off-grid procedures.


That’s some really good thoughts on DR planning. I have never thought DR to be to such an extent.

How many companies really plan for an event where their entire infrastructure goes offline and their entire team gets killed? Does even companies like Google plan for this kind of event?


> How many companies really plan for an event where their entire infrastructure goes offline and their entire team gets killed?

Since 9/11, more than you might think. For example Empire Blue Cross Blue Shield [1] had its HQ in the WTC.

https://www.computerworld.com/article/2585046/empire-blue-[1... cross-it-group-undaunted-by-wtc-attack--anthrax-scare.html


Fixed link: https://www.computerworld.com/article/2585046/empire-blue-cr...

And what a blast from the past:

> Some of the temporary locations, such as the W Hotel, required significant upgrades to their network infrastructure, Klepper said. "We're running a Gigabit Ethernet now here in the W Hotel,'' Klepper said, with a network connected to four T1 (1.54M bit/sec) circuits. That network supports the code development for a Web-based interface to the company's systems, which Klepper called "critical" to Empire's efforts to serve its customers. Despite the lost time and the lost code in the collapse of the World Trade Center towers, Klepper said, "we're going to get this done by the end of the year."

> Shevin Conway, Empire's chief technology officer, said that while the company lost about "10 days' worth" of source code, the entire object-oriented executable code survived, as it had been electronically transferred to the Staten Island data center.


The two I've worked for that took it that far were a Federal bank, and an energy company. I have no idea how far Google or other large software companies take their plans.

But based on my experience, the initial recovery planning is the hard part. The documentation to tell a new team how to do it isn't so painful once the base plan exists, although you do need to think ahead to make sure somebody at your back-up vendor has an account with enough access to set up all the other accounts that will need to be created, including authorization to spend money to make it happen.


The last company I worked for where I was (de facto) in charge of IT (small company so I wore lots of hats) could have recovered if both sites burnt down and I got hit by a bus since I made sure that all code, data and instructions to re-up everything existed off site, that both most senior managers understood how to access everything and enough to hand it to a competent firm with a memory stick and a password.

In some ways losing your ERP and it's backups would be harder to recover from than both sites burning down, insurance would cover that at least.


Yes, Google plans extensively and runs regular drills.

It's hearsay, but I was once told that achieving "black start" capability was a program that took many years and about a billion dollars. But they (probably) have it now.


"black start" for GCP would be something to see. Since the global root keys for Cloud KMS are kept on physical encrypted keys locked safes, accessible to only a few core personnel, that would be interesting, akin to a missile silo launch.


It would be amazing to see. But I hope we never have to.


So 'black start' is a program to start over from scratch? The scale required for it itself would be amazing.


"Black start" is a term that refers to bringing up services when literally everything is down.

It's most often referred to in the electricity sector, where bringing power up after a major regional blackout (think 2003 NE blackout) is extremely nontrivial, since the normal steps to turn on a power plant usually requires power: for example, operating valves in a hydro plant or blowers in a coal/gas/oil plant, synchronizing your generation with grid frequency, having something to consume the power; even operating the relays and circuit breakers to connect to the grid may require grid power.

The idea here is presumably that Google services have so many mutual dependencies that if everything were to go down, restarting would be nontrivial because every service would be blocked on starting up due to some other service not being available.


I work for a bank. We have to do a full DR test for our regulator every six months. That means failing all real production systems and running customer workloads in DR, for realsies, twice a year. We also have to do periodic financial stress tests - things like "$OTHER_BANK collapsed. What do you do?" - and be able to demonstrate what we'll do if our vendors choose to sever links with us or go out of business.

It's pretty much part of the basic day-to-day life in some industries.


The company I work for plans for that and it's definitely not FAANG. In fact, DR planning and testing is far more important than stuff like continuous integration, build pipelines, etc.


> Every large company I've worked for has had recovery plans in place, including scenarios as disturbing as "All data centers and offices explode simultaneously, and all staff who know how it all works are killed in the blasts."

I sat in on a DR test where the moment one of the Auckland based ops team tried asking the Wellington lead, the boss stepped in and said "Wellington has been levelled by an earthquake. Everyone is dead or trying to get back to their family. They will not be helping you during the exercise."


This reminds me of what happened to the financial services Cantor Fitzgerald after 9/11, just replacing a system with hundreds of lost employees:

https://www.nytimes.com/2014/11/19/magazine/the-secret-life-...


I was at CF (at new offices, obviously) briefly a couple weeks after 9/11.

They had backups and were able to recover data and systems.

By the time I got there, they were somewhat functional.

The biggest problems were the lack of knowledgeable personnel, not lost data or systems.


Thanks for sharing, for some reason I think about this story a lot. It must have been such an emotionally difficult time for everyone involved in piecing back together their processes.


>Thanks for sharing, for some reason I think about this story a lot. It must have been such an emotionally difficult time for everyone involved in piecing back together their processes.

I was there as a consultant and didn't know anyone there when I went.

I won't provide any details out of respect for those fine people, but the grief was so thick, you could have cut it with a knife. As I said, I didn't know anyone who was there (or wasn't there) but after a day, I wanted to cry.


Happened to Ma.gnolia, which was the number 2 bookmarking site behind Del.icio.us in that era: https://en.wikipedia.org/wiki/Gnolia

HN comments at the time: https://news.ycombinator.com/item?id=487497

The site relaunched a month later and shut down for good a year after that


My tangential thought in that regard is what if this is a really bad outage that causes Slack to tank (i.e. A large number of companies switch to Microsoft, Zulip, etc). Equally interesting a thought.


In 2011 a small amount (0.02%) of Gmail users had all their emails deleted due to a bug: https://gmail.googleblog.com/2011/02/gmail-back-soon-for-eve... They ended up having to restore them from tape backup, which took several days. Affected users also had all their incoming mail bounce for 20 hours.


Google would be catastrophic because so much is stored there.

Slack is mostly real time communication, at least for me. There are a few bits and bobs that really should be documented that are in the messages though.


If this thread is to be believed, apparently a lot of engineers use slack for alerting and don't know how to check their monitoring software manually.


Yeah, Google would easily top the list of companies which can have catastrophic impact. Microsoft, Apple, Salesforce, Dropbox would be the next in the list I guess if we leave out the utility companies and internet providers etc.


Just look at the impact a 40 minute outage of Google Auth had last month, I wouldn't be surprised if the global productivity hit during that outage was in the billions of dollars, and that was for a relatively short outage without any data loss.


AWS outages have basically crippled a few businesses. The longest I know of was 8-10 hours the day before Thanksgiving. Some Bay Area food company got hit by it and couldn’t deliver thanksgiving dinners.


Being in DR,I live my life wondering about that too. I spend alot of extra time checking accounts and making sure that I print (yes, sneakernet) out important data as well as have manual copies of passwords. Its old school, but it removes the risk to my business in case of a total loss of a global service and lowers the risk of a heart attack and related stress.

The rest of the world may not be so energetic re: their accounts and data, so it would be painful for many, it depends on their how much risk they are willing to experience.

Being in DR, it is very difficult for businesses to allocate the time and resources to good planning - for many, DR is an insurance policy. Staff: engineering and development are focused on putting out fires however, a real DR is more than most companies can handle if they have not planned accordingly or practiced through testing failover/normalization processes as well as performing component-level testing.


This should actually be part of your Disaster Recovery plan. You should have at least some plan for the loss of all of your service providers. Even if that plan is to sit in the corner and cry (j/k).


We might start to see actual legislation around implied SLAs in the US which would cause Google to rethink everyone's 20% project being rolled out for 2 years.


It would be a mass customer extinction event for said service, and would effectively result in a windfall for competing services


Services like Slack are replaceable to most extent. How does even replace a service like Google easily? There are like to like services available for Google but the data is where it becomes tricky. Almost 1bn people losing their email addresses could cause massive issues.


That wouldn't be ideal.


These events seem to be happening almost on a monthly basis now. IRC was never this unreliable and at least with netsplits it was obvious what had happened because you'd see the clients disconnect.

IME messages just fail to send with Slack, then you can retry but they're not properly idempotent and you end up sending the messages twice.

It's really poor.


Its especially strange when you think about how unoriginal Slack's product domain is, and how comparable, and in some cases small, their userbase is.

* iMessage, which likely handles something in the range of 750M-1B monthly actives.

* WhatsApp, 2B users [1], though no clarity on "active" users.

* Telegram, 400M monthly actives [2]

* Discord, 100M monthly actives [3]

* Slack, 12M daily actives [4]

* Teams, which is certainly more popular than Slack, but I shudder to list it because its stability may actually be worse.

The old piece of wisdom that "real-time chat is hard" is something I've always taken at face-value as being true, because it is hard, but some of the most stable, highest scale services I've ever interfaced with are chat services. iMessage NEVER goes down. I have to conclude that Slack's unacceptable instability, even relative to more static services like Jira, is less the product of the difficulty of their product domain, and moreso something far deeper and more unfixable.

I would not assume that this will improve after they are fully integrated with Salesforce. If your company is on Slack, its time to investigate an alternative, and I'm fearful of the fact that there are very few strong ones in the enterprise world.

[1] https://blog.whatsapp.com/two-billion-users-connecting-the-w...

[2] https://techcrunch.com/2020/04/24/telegram-hits-400-million-...

[3] https://wersm.com/discord-reaches-100m-monthly-active-users-...

[4] https://www.cnbc.com/2019/10/10/slack-says-it-crossed-12-mil... (this was also announced on Slack's blog, but that's down).


I didn't realize that Discord has way more active users than Slack. I'm glad, Discord is a fantastic service in my experience. It's a shame they got shoe horned into a mostly gaming oriented service. I've never had a class or worked somewhere where Discord was a considered solution instead of Slack, but I can't think of anything that Slack does better (in my experience). In general, I think Discord has the best audio and video service that I've used, especially kicking Zoom to the curb.


Discord is definitely in the same realm of scale as Slack, and probably bigger (they publish different metrics, so its hard to say for sure).

The really impressive thing about Discord's scale is the size of their subscriber pools in the pub-sub model. Discord is slightly different than Slack in the sense that every User on a Server receives every message from every Channel; you don't opt-in to Channels as in Slack, and you can't opt-out (though some channels can be restricted to only certain roles within the Server, this is the minority of Channels).

Some of the largest Discord servers have over 1 million ONLINE users actively receiving messages; this is mostly the official servers for major games, like Fortnite, Minecraft, and League of Legends.

In other words, while the MAU/DAU counts may be within the same order of magnitude, Discord's DAUs are more centralized into larger servers, and also tend to be members of more servers than an average Slack DAU. Its a far harder problem.

The chat rooms are oftentimes unusable, but most of these users only lurk. Nonetheless, think about that scale for a second; when a user sends a message, it is delivered (very quickly!) to a million people. That's insane. Then combine that with insanely good, low latency audio, and best-in-class stability; Discord is a very impressive product, possibly one of the most impressive, and does not get nearly enough credit for what they've accomplished.

For comparison; a "Team" in Microsoft Teams (roughly equivalent to a Discord Server or Slack Workspace) is still limited to 5,000 people.


I really agree Discord is amazing and wish I could use it for work instead of Slack.

I think the big things that prevent it from being adopted more for professional use is the lack of a threading model (even though I hate it when people use threads in Slack) and the whole everyone in every channel except for role-based privacy settings. The second one especially is a big deal because you can't do things like team-only channels without a prohibitive amount of overhead.

That said (with zero knowledge of their architecture) I have to feel like both of those missing features aren't too terribly hard to build. Its very likely Discord is growing as a business fast enough on the gaming and community spaces they don't feel the added overhead of expanding into enterprise (read: support, SLAs, SOC, etc) makes sense and are waiting until they need a boost to play that card.


> I think the big things that prevent it from being adopted more for professional use is the lack of a threading model

They do have a threading model now (if you are talking about replying to a message in a channel and having your reply clearly show what you are responding to). If you are talking about 1-on-1 chats with other people in your same server then yes, that is still lacking IMHO in discord. The whole "you have to be friends" to start a chat (or maybe that's just for a on-the-fly group) is annoying.


Discord gives every user an identity that is persistent beyond the server; you have a Discord account, not a server account. Slack does the opposite. Enterprises would hate Discord's model, as they prefer to control the entire identity of every user in their systems, such that when they leave the company they can destroy any notion of that identity ever existing.


Absolutely agree. I like the 1 main discord account but I wish I could have 1 "identity" per-server as well. I don't love that I am in some discords that I don't want tied to my real name and others where I've known these people for over a decade and would see in person multiple times a week (before the pandemic). I know you can set your name per-server but you can't hide your discord username (or make it per-server) which sucks.


Agreed completely. Discord has always been much smoother for me than Slack, and the voice/video chat quality is literally the best I've ever seen anywhere. If they made their branding a bit more professional and changed the permission model from the (accurate) garbage you described to something closer to Slack then I think Slack would be doomed.


We use Discord exclusively at my day job.

We have a few bots we've integrated with things (deployment, stats, etc).

We use it for all our voice/video calls.

Edit: We've got roles setup well for things like contractors, devs, marketing, etc, so it's easy to lock down different conversations in channels.

It's been fantastic.

The only thing I'm not a huge fan of is the (IMO) poor implementation of threaded discussions.

Edit: it definitely has issues with connectivity from time-to-time too, but not bad overall.

TBH, I'm not sure why companies use Slack (I use it for other organizations, so have experience with it too, but not extensive).


>I didn't realize that Discord has way more active users than Slack

Keep in mind you're comparing daily active users vs monthly active users. I'd guess most slack users are online weekday for pretty much the entire day (because it's for work and your boss expects you to be online), whereas a good chunk of discord users are only logging in a few hours a week when they're gaming.


At 12:00pm EDT on a workday:

Minecraft official server: 190k online users. | Fortnite official server: 180k online users. | Valorant official server: 170k online users. | Jet's Dream World (community): 130k online users. | CallMeCarson server (YouTuber): 100k online users. | Call of Duty official server: 90k online users. | Rust (the game) official discord: 80k online users. | League of Legends official server: 60k online users. | Among Us official server: 50k online users.

Their scale is insane. Even with their usage spiking during after-hours gaming in major countries, their baseline usage at every hour of the day, globally, makes it one of the most used web services ever created.

Slack's DAU and MAU numbers are probably pretty close to one-another. Discord's MAU/DAU ratio is probably bigger than Slack's. That just means that Discord is, again, solving a harder problem; they have much bigger (and more unpredictable) spikes in usage than Slack. Yet, its a far more stable and pleasant product.


Our secret sauce is Elixir/BEAM and Rust :)

Well for the real time side, I can't tell you how big a boon it's been to build our platform on top of Elixir/BEAM. Hands down the best runtime / VM for the job - and a big big secret to our success. Where we couldn't get BEAM fast enough - we lean on rust and embed it into the VM via NIFs.

2021 is the year of rust - with the async ecosystem continuing to mature (tokio 1.0 release) we will be investing heavily in moving a lot of our workloads from Python to Rust - and using Rust in more places, for example, as backend data services that sit in front of our DBs. We have already piloted this last year for our messages data store and have implemented such things as concurrency throttles and query coalescing to keep the upstream data layer stable. It has helped tremendously but we still have a lot of work to do!

To help scale those super large servers, in 2020 we invested heavily in making sure our distributed system can handle the load.

Did you know that all those mega servers you listed run within our distribution on the same hardware and clusters as every other discord server - with no special tenancy within our distribution. The largest servers are scheduled amongst the smallest servers and don't get any special treatment. As a server grows - it of course is able to consume a larger share of resources within our distribution - and automatically transitions to a mode built for large servers (we call this "relays" internally.) At any hour, over a hundred million BEAM processes are concurrently scheduled within our distributed system. Each with specific jobs within their respective clusters. A process may run your presence, websocket connection, session on discord, voice chat server, go live stream, your 1:1/group DM call, etc. We schedule/reschedule/terminate processes at a rate of a few hundred thousand per minute. We are able to scale by adding more nodes to each cluster - and processes are live migrated to the new nodes. This is an operation we perform regularly - and actually is how we deploy updates to our real time system.

I was responsible for building and architecting much of these systems. It's been super cool to work on - and - it's cool to see people acknowledge the scale we now run at! Thank you!! It's been a wild ride haha.

As for scale, our last public number perhaps comparable to Slack is ~650 billion messages sent in 2020, and a few trillion minutes of voice/video chat activity. However given the crazy growth that has happened last year due to COVID - the daily message send volumes are well over the 2 billion/day average.


Assuming this is real - very interesting read, thanks for sharing.


Just anecdotal, but as someone who has used Teams continuously for 1.5 years, I can say that it has never been down for me.

That being said, individual instances of the app are notoriously unstable causing random annoyances. But, I am on a very early build of Teams, which is buggy by definition.


Slack and the others have different contractual guarantees and different regulatory environments. Comparing them is not really fair because the reality is that these other services probably just lose tons of messages and slack/teams can't do that! They have to have better guarantees.


IME, Slack is far more likely to lose my message than iMessage. I believe that's part of the point being made above.


I've never had slack lose a message when it's up


That's kind of the definition of a service being up. :) I've experienced numerous "soft" outages which result in messages not sending and getting lost - and even more double sends, sometimes very distant from where the message was originally sent.

ITT: Anecdotes


It isn't just # of users, though - SlackOps is probably unique to Slack in that list (minus Teams, I guess) - so # of messages per month is a better metric. Not that I'm letting Slack off the hook, it still may be that their codebase and/or dev process is just nasty.


Telegram is closer to 500 million now.

https://t.me/durov/142


EFnet was always splitting every few hours. I don't really miss IRC compared to modern chat systems.


I'm the opposite. Back when in my early teens, friends and I would attempt to hijack opposing groups' channels via takeovers during net-splits (and ofcourse having the same done to us). What a time to be alive.


In the early battle.net days competing clans would split and steal channels. It was tons of fun. Taught me lots about bots, proxies, simple scripting, in the process too.


Oh yeah, those were the days. Causing server splits to get your nick back that was stolen in a previous server split...


I do miss them, terribly. Lightweight, fast, brutally simple. Even with splits, it was better, and ever since IRC bouncers exist, like ZNC, they are rock solid.


I spent a few hours setting up a chat client for a reason. Slack takes all this away from me.


My feeling is this is an AWS issue. Our services hosted in AWS are not working either.


Downdetector indicates a possible correlation between Slack issues and AWS reports, eventhough Slack peaks at 13960 problems and AWS at 111:

- https://downdetector.com/status/slack/

- https://downdetector.com/status/aws-amazon-web-services/


Most people that are affected by AWS outages wouldn't report it as such...


I have also been having intermittent issues with Twitter also this morning (can't load tweets etc) and was wondering if it was connected.


Naaah, Twitter always fails to load for me. It's more surprising if it loads from the first attempt.


I thought Slack was not on AWS, but Oracle.


You may be thinking of Zoom, who signed a massive Oracle contract.


What issues are you seeing on AWS?

The dashboards are all green. (which doesn't mean that much ... I'm aware...)


Do you have more info like services, regions etc? I see all green checks on the AWS Status page.


> I see all green checks on the AWS Status page.

I'm sure you know this already, but that status page isn't worth the cycles on your CPU, you would be better served asking the toaster if AWS is functioning properly than checking that status page.


Of course, yeah, but at least you can sometimes see a yellow and infer it really means red :/.


yellow requires an issue that every customer is aware of and red requires a thermonuclear strike.


If one's smart toaster depends on AWS one might very well do that.


Our prod systems seem to be working, but our lower environments seems to be not working. I don't know enough about where these things come from. I wonder if the real problem is regional. Some connections work and some don't.


Down detector shows quite a lot of issues across a broad spectrum of services, including AWS and Google.


Which AZ?


Availability zones are unique for each account. So my zone A could be your zone C, for example.


I never knew this, but I think it makes sense. Is there any documentation that explains why this is the case? I suspect it is to distribute bias to the first option, but I'd love to read about it.

[edit] Nevermind, I just needed the right combination of terms to find it: https://docs.aws.amazon.com/ram/latest/userguide/working-wit...


This is so everyone doesn't launch in one zone, "us-east-1a".


Woah, thanks for clarifying--I had no idea!


To be fair, IRC doesn't do a lot of things Slack does. Where is the logging and audit trails, access control, search, etc.


I'm still dreaming of a world where everyone uses IRC through an interface identical to Slack or Discord or whatever, and features like these are implemented.


I agree in principle, but IRC is a poor way to do this. I love IRC for it's simplicity, but that makes it hard to do more advanced features. It's a text-only protocol (other than DCC), so if you want to do something like allow users to click phone numbers to dial them then you have to regex it and hope for the best. Any kind of link is the same way. If you want to show images inline, you'll have to search for links, then either do another regex to see if the link is an image or prefetch the page to see if it's an image. Most servers still implement user authentication as a secondary service (i.e. it isn't part of the IRC server itself) afaik. I think the newer IRC specs include those, but support for it is missing in many servers.

Really a huge part of IRC's difficulty and beauty is in not having a markup language, but most of that beauty is for the eyes of the developer, not the user.

I like the concept of Matrix. That's kind of what they're trying to do by creating an open protocol, but when I looked at implementing a client it was non-trivial. For IRC, you can usually send someone a telnet log of you joining an IRC server and they could implement a client. I don't get the impression that that's true for Matrix.


https://news.ycombinator.com/item?id=20948530 is my attempt to demonstrate that implementing a Matrix client is almost as trivial as telnetting to port 6667 on an IRC server, fwiw :)


You might like irccloud; it's a web client (similar to slack) and bouncer, with support for image uploads, has a decent app, preserving history and I think it supports search too.


You might appreciate matrix.org / element.io if you haven’t seen them yet


I setup my own Matrix homeserver recently with several bridges to all my current chat services:

https://battlepenguin.com/tech/matrix-one-chat-protocol-to-r...

It works fairly well.


Not really a fan of the Slack or Discord user interface myself, but there are modern looking web clients for IRC such as thelounge[0] or kiwiirc[1] that might be what you are after.

[0] https://thelounge.chat/ [1] https://kiwiirc.com/


Several IRC servers do have support for authentication and access control (and audit trails as well I suppose).

Only centralized history/logging and search would need to be bolted on if needed. In the non-centralized case your IRC client takes care of all of that.



Lack of logs and history is a feature not a bug.


For business users, there are regulatory requirements. You need to keep information around for some period of time, but not forever. History and searching is useful for spreading tribal knowledge throughout an organization.


Does that actually extend to Slack/slack-like things though?

Since I would see Slack more of a replacement for phone calls or hallway discussions. Neither of which typically has any logs or recordings (and I wouldn't want to work somewhere that did keep such logs).


It does yes. This is why for example message history data export is a paid feature. Its a requirement for certain types of compliance.


In what areas would you find such requirements? And shouldn't the default position be that it is illegal to keep those logs? Especially those involving direct messages between employees.


Sure I understand but I don’t think that aligns with the spirit of the comment I replied to. I read it as FOMO.


At least when irc goes down you can still access your logs


Our company uses Cliq. I wouldn't say that it's as good as Slack, but it's probably 80-90%, and even has a few unique features (integration into Zoho's suite, remote work checkin, integrated bot development environment, etc)


I find it amazing that we can be about an hour and a half into a service being completely unusable(ie. Slack telling me it 'cannot connect'), yet it's still marked as an 'incident' instead of an 'outage' in their own status page


It's marked as an outage now.


and yet they're still proudly proclaiming: "Uptime for the current quarter: 100%"


Every time this kind of thing happens HNers love to grip about how the status pages aren't correct yet. It's so weird -- like the people freaking out about the outage are going to be updating their uptime trackers right now or something. Who cares? It'll be fixed later.


Well, who consults a service's status page when it isn't down? During an outage is literally the only time a service status page has any function.

A status page that doesn't get updated during an outage is about as much use as a solar-powered flashlight (without built in power storage).


I think the point is that a "Status Page" should show the accurate, current status of the system. Not a place holder for "we'll fix it later". People look at a status page to know what's happening now.


I wasn't talking about the status page, I was talking about the uptime % tracker.

edit: oh, sorry, i did say 'status page' in the first part. But I kinda meant update % tracker like the parent.


Enterprise contracts have SLA's about uptime, so it's definitely relevant.


This is entirely in line with my experience dealing with outages. 85% of the time to fix consists of fielding requests for status updates.

It's like when people push the elevator button repeatedly if it's taking a while to arrive, only pushing the elevator button doesn't cause it to take even longer.


what's the point in a status page that only updates after the outage has been resolved?


It doesn't. The status page is currently showing information about the outage. And the 100% uptime number is probably still correct, since it's only been out for a couple of hours.


> And the 100% uptime number is probably still correct, since it's only been out for a couple of hours.

It's listed as "Uptime for the current quarter"; if they mean that as "calendar quarter", i.e. since the start of the year, then we aren't even 100 hours into the quarter so we should be well below 100% by now.


You might be correct, but why would anyone care about quarter-to-date as opposed to a rolling quarter ending now? The latter would mean that an outage of X duration will always reduce this statistic by the same amount regardless of how close the nearest calendar quarter boundary is, which seems like a superior quality for such a statistic to have.


That would be a completely fair metric to publish, but it doesn't look like what Slack is publishing. Of course, it's possible that it is and it's just phrased somewhat poorly.


Fair point.


Interestingly their uptime for the quarter is still 100% despite a full-red dashboard. I wonder if that's something that is calculated only after an outage is resolved


Well until the issue is resolved you can't know how long you've been down for, so you can't actually update the uptime.


Why not? It could be updated second by second automatically if they wanted to.

Probably not a priority though.


Building out the infrastructure to automatically give real-time updates to your uptime figure sounds like a terrible use of company resources. Who knows how many person hours to spend on implementing and maintaining a feature that would remove maybe a few minutes of manual work from the incident post-mortem checklist, just for the sake of delighting people who need something else to look at for a workplace distraction now that Slack is down.


Well, now the outage is marked as resolved. And the uptime is still "100%".


A great opportunity to try out Element, an open source client for the federated and open Matrix network [1]: https://element.io/

(edit: to clarify: not affiliated in any way, just a fan)

[1] https://matrix.org/


I've found that element.io itself is really slow to load.

That said, we're pushing up against the limits of our free plan with Slack and will likely deploy a matrix server in due course.


The good thing about that is that if you want a fast client there are quite a few native clients to pick.

https://matrix.org/clients/

For example, Mirage is Python/QT and quite fast in my experience. There are Rust clients, C++ clients, terminal based ones, etc.


Do you happen to know of any desktop clients that support encryption/cross-signing?

I'd like to get off of Element desktop/web for a couple of reasons, but I need those features. I'd help implement them myself, but that's beyond my skill level.

Edit: For anyone else wondering, matrix-commander [0] looks like it may be workable if a cli tool is acceptable for your usecase.

[0] https://matrix.org/docs/projects/client/matrix-commander/

I'm planning on looking through the GUI ones at some point, but don't have time now.


Fluffychat [1] is built with Flutter and apparently supports e2e encryption.

Note: I wanted to try it out for a while, but haven't yet.

[1] https://gitlab.com/famedly/fluffychat


Maybe we can chat with coworkers here. Is there a Carl around?


I'm a Carl. I'm also looking for a coworker who was trying to contact me. If it's about last saturday, I promise nothing really happened between me and her, but I'm sure she already told you.


Do you guys not have e-mail?

looks through Inbox of 850 new aws, batch job and logging messages

oh yea, that's right..


Don't you guys have e-mail filters?

"Hey, our site has been down for 2 hours, why aren't you doing anything"

Looks at 850 unread messages in ops-notifications folder

ooh yeah, that's right..


"Looks at 850 unread messages in ops-notifications folder"

In my organization it's spelt "deleted items"


I'm Carl.

I lost the login for our shared AWS account. Mind sending it to me here?


root / hunter2


Yes it's


I have tried to sell my organisation on a shared Google Chat doc for 90s style realtime ICQ chat in times like these, but there has been little uptake.


G Suite actually has an entire Slack clone, chat.google.com. I've been on G Suite (now annoyingly renamed to Google Workspace) for years and actually just recently found it existed from another comment on HN.


Yeah, this is what we actually use as a fallback, and I did push for this as an full time alternative given we'd get it free, but people dislike it for all sorts of frivolous reasons.


Well now's the time for a big push!

Oh wait, how would you share the link...


Hey it's me, your Carl, send me your code


How many more outages until all trust is eroded and competing services differentiate themselves on the basis of uptime?


I say this every time Slack is down, but they just seem so shady to me. Nobody can connect right now, and their status site says "100% uptime in the last quarter". Maybe it's close to 100%, but it ain't 100%.

I think we should push for a metric where "up" means 100% of people that want to use the service are able to use the service. If 1% of users can't send messages, then that should count as a full-blown outage and should start counting against whatever SLA they advertise.

The underlying problem here is that apparently everyone lies about uptime, so if you don't, that looks bad to potential customers. I fear that we will have to push for some legal regulation if we want accurate data, and ... people will probably be opposed to that.


Seems silly to worry about quarterly stats several hours into an outage. The most obvious explanation is quarterly stats aren't generated in real-time -- which isn't "shady" to me.


> I think we should push for a metric where "up" means 100% of people that want to use the service are able to use the service.

I mean, that’s nice to say, but how do you measure/prove it?

Certainly, having the SLAed party check themselves is silly. But what are the other options? If it was up to the customer, customers could make up faults to get free service. (Since it’d be up to the customer to prove, and customers are generally less technical than vendors, you’d have to expect/accept very non-technical — and thus non-evidentiary! — forms of “proof”, e.g. “I dunno, we weren’t able to reach it today.” Things that could have just as well been their own ISP, or even operator error on their side.)

IMHO, contractual SLAs should be based on the checks of some agreed-upon neutral-third-party auditor (e.g. any of the many status/uptime monitoring services.) If the third party says the service is up, it’s up in SLA terms; if the third party says the service is down, it’s down in SLA terms.

(And, of course, if the third party themselves go down, or experience connectivity issues that cause them to see false correlated failures among many services, that should be explicitly written into the SLA as a condition where the customer isn’t going to get a remedial award against the SLA, even if the SLAed service does go down during that time. If the Internet backbone falls over, that’s the equivalent of what insurance providers call an “act of God.”)

But in a neutral-third-party observer setup, you aren’t going to get 100% coverage for customer-seen problems. An uptime service isn’t going to see the service the way every single customer does. Only the way one particular customer would. So it’s not going to notice these spurious some-customers-see-it-some-don’t faults.

So, again: what kind of input would feed this hypothetical “100% of customers are being served successfully” metric?

ETA: maybe you could get closer to this ideal by ensuring that the monitoring service 1. is effectively running a full integration test suite, not just hitting trivial APIs; and 2. if gradual-rollout experiments ala “hash the user’s ID to land them in an experiment hash-ring position, and assign feature flags to sections of the hash ring” are in use by the SLAed service, then the monitoring service should be given N different “probe users” that together cover the complete hash-ring of possible generated-feature-flag combinations. Or given special keys that get randomly assigned a different combination of feature-flags every time they’re used.


> If 1% of users can't send messages, then that should count as a full-blown outage and should start counting against whatever SLA they advertise.

Google published a paper last year describing this approach to measuring uptime: https://blog.acolyer.org/2020/02/26/meaningful-availability/

The idea is to define availability as "the probability that the site 'appeared' to be down for a random user, averaged over a time window of size w". You can choose a particular value of w and look at trends over time, or you can plot availability as a function of w to understand patterns of downtime.


They should at least update the status site to reflect issues currently happening.

I was wondering why the link from a Jira wasn't opening in slack, the page eventually timed out and gave me a link to status.slack.com where it told me everything was peachy. Cue me wasting time trying it again because apparently there was no issue with slack..


That number is almost certainly updated manually. Check back tomorrow and see what it says.

If you look at the history page you can see its not 100% for every month: https://status.slack.com/calendar


(2 days later) The status site now shows %99.9 uptime for the quarter.


Some companies do this, though probably not publishing data. Any customer downtime is treated the same - for one, for many, for all (in theory, ha ha). But they take it pretty seriously.


You'll just end up with no SLA or pay a hefty amount to use services because that's an impossible standard to support for any service of a size like this.


Isn't this the problem? Companies like Slack set SLA's that they only meet by lying about their uptime. It's as good as having no SLA, except you're likely paying a premium based on the SLA they set.


I'm not demanding 100% uptime, I'm asking that they say "99.94% uptime" when there has been an outage.

Honestly, I could live with a 99.50% SLA, if that's what it really was. After today's probably full-day outage, they'd just have to be extra careful for the rest of the year (or pay me money). Kind of sucks when it's 1/4 that you blow your year's SLA budget though.


If you're asking genuinely then I can tell you my experience when I was part of a SaaS shop, though the times have changed a lot and "my metric is not necessarily your metric".

But it was roughly "one large impact a month, for six months", with large caveats that upper management for whatever company had to be working with the product during that month.

Large companies don't care if X service went out during the night and impacted someone not in their timezone.

If the CTO notices that he can't use something with the same regularity that he gets paid, then it doesn't take long for it to stick in their mind. But migrating everything is _so painful_ that the majority of large companies will do anything they can to avoid moving away.


> But migrating everything is _so painful_

This is a key point is the popularity amongst VCs in investing in B2B SaaS. I take their (and your) word for it. But honestly, I don't actually understand this.

Why is migration so hard?


Medium sized team on Slack. We'd need to move ~60 full time in-house employees, ~10 remote contractors who aren't on other comms channels, ~20 infrequent freelance contributors who may not check messages often, ~5 custom bots and apps, and ~15 3rd party integrations (of which some won't support any given choice of alternative).

This is not to mention the fact that half our staff aren't hugely technical, so have actively _learnt_ how to use Slack and it's features around notification control (things that may come "naturally" to the tech-savvy crowd on HN), @-things, bots, etc, and they would need to re-learn a new tool that is going to work in a different way.

This would be a substantial effort for us, and we're a small company. Are there ways to materially minimise this cost?


Training, integration with proprietary internal systems, sheer momentum in the employee base, justifying or even creating a metric to show cost savings of a migration effort, business processes that rely on a specific feature of existing infrastructure needing to be met, the uncertainty of new vs the certain and known instability of something you have....

If you had a small shop with a dozen tech-savvy people and Slack became a problem which was used exclusively for quick business chats, you could probably push a change to another chat platform the next day. You might struggle when you have thousands of employees, some that needed training to use Slack and still aren't that proficient.


Getting workflows re-established, any integrations you had developed or otherwise come to depend on may not work, you will probably lose history, etc.

Plus, it will just take a long time to get everyone on board and using the replacement system. My department is slowly plodding towards using Teams over Slack, but there are enough hold-outs (my sub-department being one of them) that it still doesn't have wide-spread adoption.


Many reasons, almost none of them technical. Off the top of my head, a few:

* Getting out of the Enterprise Contract, or waiting for the year to end. * Training people on new software. * Loss of productivity. (1) Learning a new UI, processes, workflows -- both individually and organizationally. A feature or concept in "Tool A" may exist in a completely different form in "Tool B". Or not exist, and then people need to adapt to and work around the missing feature. (2) Missing out on needed information due to the above. Ultimately, software exists to move and transform data, and when you change the software people have to adjust. Sometimes that doesn't go great. "Oh, I didn't realize I needed to check this checkbox".

Another way to say this is "organizational inertia", which is a fancy term that means "it's hard for people to adjust to change".

And you might think developers and other technical people would have an easier time of it. They (we) do, but not to the extent you may expect. I've been on the front lines of a handful of migrations that affected only the IT staff, and it was a long and arduous process each time.


> Loss of productivity. (1) Learning a new UI

Man it bothers me so much when applications change their UIs on updates for no apparent reason other than "it looks better".

IntelliJ changed the way build and debug buttons looked in some update and it took me days to get used to it and I could find them in a snap again. Slack did a couple of no-reason changes as well.


There are plenty of UX reasons (learning new interfaces, etc). The burden here is generally distributed and diffuse.

The really big one, for companies of a certain size / cash flow, is compliance. Companies spend a lot of time developing compliant work flows around a service like Slack.

Migrating to another service requires rewriting the compliance narrative. The current compliance people might not have the confidence or willpower to do that effectively, and can raise legal objections to any such migration indefinitely.


>Why is migration so hard?

Why would anyone make it easy


I think there should be a new computer science law (if this one doesn't exist already):

Things that are easy to migrate from get replaced by things that are hard to migrate from, eventually.

IRC is incredibly easy to migrate from.


IRC is easy to migrate from since there is nothing to migrate other then chat history. IRC is also missing so many features that slack provides out the box. And a law like that would not work since you would need to write complicated transformation scripts to transform between services. Also not all services are a 1-1 mapping. I like IRC but it has its limitations. That is why slack succeeded where IRC did not.


> And a law like that would not work

The parent meant a law as in "a law of physics", not a piece of legislation.


We can call it the "Law Of Lotus Notes". I'm not sure if it's hard to migrate from, I can only assume that it is impossible to migrate from.


I wouldn't be shocked if businesses saw increased productivity during these.


I have less of an excuse not to be more personally productive, but I can't help anyone else (easily) if my primary method of communication is down. Not only because it's harder to contact you, but also because it's impossible for you to just ask in a channel and have me notice you.

There's also this perverse incentive to Slack all the things. Lots of CI notifications are sent through it. Some org processes are implemented as workflows. There's been talk of how wonderful it would be to hook up tasking and work tracking to slash commands. I and others often use Slack instead of the 'official' tool to video call each other.

An outage like this is still really disruptive. It's not like everyone realizes what's going on immediately or at the same time; we have backup tools, but our turn radius is pretty wide. Some of us can't even communicate effectively without memes, too, and backup tools don't have a giphy integration.

EDIT: Do your CI integrations fail if Slack can't be contacted? Do those failures fail your pipeline? Whoops!


Particularly on a Monday morning after a holiday, there are tasks that I know I need to be working on but cannot because relevant details were never transposed from slack to our actual work scheduling tools like google docs, jira, etc. and I cannot access Slack history.


I'm sitting here not sure if I should deploy code since most communication with the rest of my team has been cut off.


If something went awry, and it caused more pain because Slack was down, how would you feel? If you’re missing comms/observability then waiting to deploy seems prudent.


So the answer is... people should stop working?


I can't speak for you, but I can:

  * work on code
  * update JIRA
  * complete required trainings
  * work on my peer reviews (Workday/Okta are up)
  * review tech specs
Deploying is actually a very small part of my job.


Of course there are other things to do. But the things I had planned for the morning are all being delayed.


Not all - many workflows these days rely on Slack or its ilk. Benderbot, Jira/etc. connectors, calendar connectors, remote communication/standups, alerting…


If you use slack primarily as a water cooler then yes.

However, I drive everything through slack - GitHub, linear, calendars, Notion, support emails, etc. I have notifications turned off for every service we use except for slack. This allows me to effectively ignore everything except for slack. These types of outages destroy that workflow for me.


Absolutely! Before the holiday shutdown, I Slacked myself a huge reminder list of things to jump on as soon as we started up again, so that I could hit the ground running in the new year. Oh, wait....


I feel a little good every time Slack has an issue - it has brought social media communications to my otherwise social media free (excluding hn) life.


You seem to ask the question about the absolute number of outages, whereas uptime is about the number of outages per units of time.


None because all competing services have some problems at some point.


For some reason, today, HN seems exceedingly exceedingly slow (to me) after logging in...

Without being logged in, things are as fast as they usually are -- but post log-in, SLOWWWER THAN MOLASSESS...

I tried this several times; why this is, I can only wonder...

To quote Bill and Ted... "Strange things are afoot at the Circle-K..."


Everyone switched from Slack to HN. :)


It is easier to cache stuff for users who are not logged in as it is the same for everyone. and everyone is looking up on Hackernews at the moment to see what is wrong with slack, which is probably the cause of the slowness.


While it is true for most applications. HN does not do any customisation of the content. I don't notice I am not logged in until commenting


The point count for most articles is consistently lower on a view of the non-logged-in homepage. I assume that means they are cached more aggressively for non-logged-in. There's also the username and karma count in the top-right.


While it is true for most applications. HN does not do any customisation of the content

I don't think that's true. For example, if you hide a thread while logged in, it remains hidden when you return.


You would cache the rendered html, the front page has your username and points and stuff. The whole page will be unique to you because of that


Anyone else having a fantastic morning/afternoon without the constant pinging? Just saying - there's always a silver lining.


It has made coming back from a long Christmas vacation a lot easier. Once I got my emails taken care of, I was able to get to work without distractions. It's been nice.


It's a nice opportunity to go for a quick walk!


I'll concede that it's possible to not know what the problem is by now, but I won't concede that this should not be called an "outage" at this point.


Huh? It's definitely an outage.


Amidst all the double negatives, I think that's what the parent poster was saying.


It's a red "do not enter" esque thing now, but when the parent posted, I think it was still a yellow triangle.

But also, the status page still proudly proclaims that the "Uptime for the current quarter: 100%" — which is clearly false at this point.


That's what they mean - double negative. It has also been upgraded on their side to 'outage'.


I initially misread this as saying that you won't concede it shouldn't be called an outrage.


Potentially an AWS issue?

Slack, notion and AWS all at same time seems unlikely

https://downdetector.com/status/aws-amazon-web-services/


Just a note, if your company uses G Suite, chat.google.com exists and is basically an entire Slack clone. We use it as a backup when Slack goes down (obviously doesn't help for bots and ChatOps we've set up, but works well for realtime work chat).


Calling it a clone is a stretch. There are a ton of features missing.


Our org just failed over to GChat as well. Piece of cake.

Quite glad we never moved any critical ops work into Slack bots, since we don't control Slack.


This is an excellent reminder of the danger of being locked into closed systems.

I wonder how many companies (like mine) have literally ground to a halt because of this? Do other companies have a risk-documented backup plan B for times like this? Presumably the default is for everyone to resort to email?

More worryingly is the number of ChatOps processes and alerting/observability systems that are in place around Slack.

Not being able to chat with co-workers for an hour or two is fine, but not being able to safely manage CI/CD/deployments is a big risk.


> This is an excellent reminder of the danger of being locked into closed systems.

Do you honestly think a self managed solution or open source solution would be more reliable for most companies?


When application engineers say stuff like this, they're also implying that there's a giant infra/ops team who will be wiling and able to do all the work for them. Nobody actually wants to be responsible for this stuff.


Not at all, I think closed private systems are far better (better products, support, service) but when an entire company runs its operations on a single system like Slack, there is a big risk when it goes away and you need contingency.

I’d still rather be on Slack and suffer a day of lost productivity than force people to use only email or IRC.


I just finished an 5 hour debugging session on what turned out to be several cascading bugs in one of the older systems at my company

Can't deploy the fix, because

- developers trigger deployments through slack and I don't have access to the underlying deployment system

- infrastructure guys who have access aren't responding to my emails


Sounds like a great opportunity to work on not having Slack as the only trigger mechanism. :)

Or a least document how to call the slack bot manually. (assuming it's just a http endpoint)


I agree 100%! Though I think it might be dangerous to prepare for the "last disaster". It'll be some other system breaking next time, so I think we instead should identify what systems that do not have some kind of redundancy and determine blast radius of those crashing

I'm good at not panicking about things I can't change, but I worry about some of my colleagues who find it difficult to not have control in these situations

I can't do anything to help them at the moment, so for now I'm heading to my couch with my analogue book :)


Could it be the obvious? Everyone signing on / loading slack clients at the same time?


My gut feeling is everyone coming out of the holidays reading back weeks of notifications.


then wouldn't this happen every monday morning?


A lot of organizations essentially took the last two weeks off from work, which is long enough for a 10-day autoscale window to spin down servers, and then get confronted by a load spike that wasn't pre-spun for.


I would be shocked if Slack operations wasn’t aware of this return to work spike and didn’t pre-scale in anticipation.


I wouldn't, since my personal theory is that the outage is due to AWS and GCP autoscale capacity exhaustion. We'll find out soon enough!

EDIT: And down goes Notion, too: https://news.ycombinator.com/item?id=25634159


>AWS and GCP autoscale capacity

What does this mean? What do cloud providers do when customers scale down their services? Do the providers literally power down servers? Do they sell the capacity to new customers?


They sell unused capacity at a much lower price (spot instances on AWS, preemptible VMs on GCP).

I don't know if they power down some servers if usage stays low for a very long time.


They rate limit how fast you can auto scale which is dependent on a slew of factors.


That doesn’t mean they chose the right number to scale to.

See for example, Amazon Prime day:

https://www.cnbc.com/2018/07/19/amazon-internal-documents-wh...


Perhaps it is a large number of people checking into channels that are backlogged with lots of bot message notifications.


This is after many had a week vacation. I'm sure most weekends some people pop in and out, and logins are more staggered on a typical monday morning.

Just a theory though.


Nice of you to ignore the majority of the world's population who have been up and working long before America woke up.


Relax. GP is clearly referring to an increase in people signing on do to the holidays ending and everyone coming back from work.

Also, Slack has significantly more users in the US than in any other country[1], and it really isn't even close. So the offense you're taking is unwarranted anyway.

1: https://saasscout.com/statistics/slack-stats/


Slack makes ~61% of its revenue from US customers which only has 4 time zones compared to the remainder of their revenue being spread out across ~20 time zones. It's not an unreasonable hypothesis.

See page 12 of the document (which is page 14 of the PDF) https://d18rn0p25nwr6d.cloudfront.net/CIK-0001764925/70df834...


Slack most likely has more US customers but

- Revenue is not same as users. Slack have tons of free users and some countries also has lower priced plans.

- Many companies like Amazon etc. probably is counted as US revenue for Slack but they have more than 30% of their employees outside the US. This should not be huge numbers but significant.


Using our teams backup chatroom in a competing service. One of these days P2P Matrix will reach GA, then I plan to make a backup for my backups, Starfleet style.


That's one obscure reference, I love it.

https://www.youtube.com/watch?v=UaPkSU8DNfY

  GILORA: Starfleet code requires a second backup?
  O'BRIEN: In case the first backup fails.
  GILORA: What are the chances that both a primary system and its backup would fail at the same time?
  O'BRIEN: It's very unlikely, but in a crunch I wouldn't like to be caught without a second backup.


Makes perfect sense for O'Brien, DS9 had serious backup issues in the first couple of years

The Forsaken (season 1 episode 17)

  LOJAL: I've been reading the reports of your Chief of Operations, Doctor. They gave me the impression that he was a competent engineer.
  BASHIR: Chief O'Brien? One of the best in Starfleet.
  LOJAL: Then why aren't the backup systems functioning?
  BASHIR: Well, you know, out here on the edge of the frontier, it's one adventure after another. Why don't I escort you back to your quarters where I'm sure we can all wait this out. 

Rivals (season 2 episode 11)

  KIRA: My terminal just self-destructed.
  DAX: What?
  KIRA: I lost an evaluation report I've been working on for weeks.
  DAX: Even the backups?
  KIRA: Even the backups. 

There's a reason to have a backup to the backup by Destiny (season 3 episode 15)


I forgot about that. Starfleet really was in good shape back in.


Late 2300s were the golden years for Starfleet and the Federation. Sad to see they went downhill later on.


> Customers may have trouble connecting or using Slack

I can't stand how marketing speak pervades every sphere of the world. Their entire system is offline (inconvenient certainly, but it happens) and they can't bring themselves to say "Slack is down. We're working on it and will be back ASAP." or something similar. Instead we may have trouble.


The funniest part to me is that their status page still says "Uptime for the current quarter: 100%". These uptime messages are so BS. Heroku reports 6 9s of uptime for this month, even though their own status page shows multiple days with incidents >6 hours


Well that's what happens when Legal joins the fun and starts defining what "downtime" means


Yeah the amount of airgapped uptime dashboard in SV currently is insane.

Even the major clouds...hn is going wild about it yet the dashboard says all good.


Someones performance bonus depends on it, you can bet there is going to be A LOT of heel dragging when it comes to updating those statuses!


How do you know it's down completely? Maybe it's down for you and maybe even down for a majority but still up for some subset. Happens with many products.


Yeah, I don't know, because the Slack status page is so vague.


https://status.slack.com/

Every service is marked as "Outage" as of now (also when I wrote the comment).


It's been a rollercoaster for me the last few hours, sometimes servers are up sometimes they are down. Point being, they are intermittently up :/


The outage is not for everybody, I can connect.


Thundering herd. If you can avoid it, don't connect.


yet also "Uptime for the current quarter: 100%"


Maybe that's just for the outage tracker. It's up.


I don't see this as a big deal. Not all metrics have to be real time.


It's not entirely offline though. I was connected via my phone ~90 minutes ago when I first got online today and never had any issues and was able to tell folks at work my PC connectivity may be spotty for a while. When I signed in via my Mac laptop I wasn't able to connect for about 20 minutes, and was redirected to the status page. I've been online for about an hour now.


Why do you consider that to be "marketing speak?" It appears to be concise, direct, and accurate. The phrase "Slack is down," even if true by some interpretations (it hasn't been "completely down" from what I have seen), is imprecise and informal.


There's a wide gulf between "some customers may have trouble using Slack" and "most/all customers are completely unable to use Slack". Putting aside formality, I'd say "Slack is down" is in fact more accurate here (assuming that it is true that most users can't use it, which is true for our company at least).


Because it's not that you "may" have trouble

If their service is down, you will have trouble. The service will be absolutely inaccessible. Don't give people hope with "may"


But 1) it has apparently not been the case that the service was "absolutely inaccessible" and 2) "Slack is down" is still very imprecise and not a great alternative even if the service had been "absolutely inaccessible."


It was pretty clear the 'may' is a euphemism when your whole system is down.


To me it's mildly irksome in the same way as people who say "may or may not". Like, yes, those are the two possibilities, thank you.


I'm willing to bet this is influenced more by SLAs and Slack's lawyers than marketing speak.


As someone in marketing, it's a little bit of this, and a little bit of determining what the most default, catch-all statement could be well ahead of time to make "crisis comms" that much smoother.


"We're experiencing increased service degradation" is so 201x-ish


I find it hilarious that the status page is still saying the uptime for the current quarter is 100%. I'd think it'd have lost at least one 9 by any obvious definition of "current quarter".


Maybe it's not updated in real time? I wouldn't publish my teams uptime metrics while a crisis was happening...


"Something's not quite right"

Another classic.


"Shit's fucked yo, send whiskey"


"Oopsie woopsie!"


I'm still logged in on mobile and can communicate with people from my team, but cannot log in from desktop. With so few people able to connect, it's also unclear whether Slack is eating my messages or there's just no one to respond. So I'd certainly rank that as "trouble using slack" rather than "the system is completely down".


The status page might just lack a branch for when everything is down entirely and only differentiates between "all green" and "not all green".

I assume this doesn't happen all that often.


I agree with you in principal, but I have had no problem connecting to Slack today (I have a free one I use with friends, not a business account) so to say they are down would also be inaccurate.


Well, most outages start with issues that increasingly get worse.

That apparently was also the case here. I started having smaller connectivity issues before it went down completely.


This is probably for legal reasons, i.e. Service Level Agreements. "May" leaves the door open to other interpretations and reporting from other systems.


For the record, I am logged in and have exchanged messages with at least one other person. The rest of my team does seem to be unable to get in though. Maybe it's because I have just had the Slack tab left open in my browser since before I left for Christmas?


I did manage to receive a message a few minutes ago, so it might be just mostly dead.


If it's all dead, there is only one thing you can do.


It's not down completely as I'm chatting with my coworkers on it now.


Very strange to be so upset by an accurate and concise statement, while offering an alternative that isn't even factually true.


Not marketing. That type of language comes from legal and the “never proactively admit fault“ mantra


It's not down for me...


It never went down for me.


Probably just some technicality to try and escape litigation wrt SLAs for their bit corporate contracts.


Chat infrastructure at this level of scale is not easy to build and maintain, I appreciate all the hard work that the engineers at Slack are putting in to resolve this.


My business coworkers are freaking out over Slack being down. But all my technical coworkers are nonplussed. It's interesting how those of us with a technical background are not too disturbed by things breaking.


Interestingly, nonplussed is one of those words having two meanings that are at odds with each other. According to Google, those two meanings are:

1. (of a person) surprised and confused so much that they are unsure how to react. "he would be completely nonplussed and embarrassed at the idea"

2. INFORMAL•NORTH AMERICAN (of a person) not disconcerted; unperturbed.


I wasn’t aware of definition 1. I did mean the informal North American definition.


I'm 79, born in the mid-west, living in New England for the last 60 years. Worked in tech. I've never heard the word used except as in Def. 1.


I've never lived in the mid-west or the New England region of the USA. Maybe it's a regional usage (I've lived in Florida, Texas, California, Colorado, Utah, Oregon, and Washington). I'm not sure where I picked up my usage from. My dad is from Colorado and my mom from California. Maybe I picked it up from one of them ;-)


I'm from New England, I grew up with the informal definition and was super surprised when I heard the formal one.


I'm "plussed," because an app that I manage uses slackclient, and some people depend on it to get paid. Obviously it's my fault for not handling the error, and I hotfixed it, but still, wah.


My slack (desktop + mobile) has been down for the past 30~ mins. Strangely I can still receive messages/alerts on my phone.


I've experienced this with Slack before where the push notifications come through but opening them fails to load.

I imagine their infrastructure to send push notifications is decoupled from their infrastructure for chat services themselves.

It'd be interesting to know if they have a master switch to disable notifications in times like this where they aren't usable anwyay.


This happens to me almost every day, when there are no incidents/outages. And it's not a network issue, the other apps work fine (e.g. WhatsApp).


I cannot get to slack in my phone or in the browser.

I wonder if this is because I haven't used the phone app in a few days, so I was already logged out, but you and others were still logged in?


Me too. However it won't let me open DMs to certain people.


Same, I still see my colleagues typing but that's it.


Same, but when I go to view the message it hangs.


I too am in this same boat.


I'd be a little nervous if I'd recently bought Slack for $20B.

It's not like there aren't alternatives. You could even imagine someone has a live bridge between Mattermost and their Slack team, making the switchover seamless.


Why be nervous? Outages happen. If this were a string of major issues over a few weeks or months, that might be cause for concern, but a single incident is not.


c-suite politics are brutal. There is always a reason to be nervous, it's just a matter of degree.


Can we please trade in centralized Slack and single-sign-on and get back the netsplit of IRC :) ? At least I can chat with half of my colleagues :)


I wonder if this is one of the larger natural drops/spikes of legit users that their infrastructure have seen?

* lots of users are coming back to work after the holidays today

* lots of users take the holidays off and fully disconnect

* significant new users added in 2020, with so many teams going remote

Sounds like a possible recipe for infra scaling issues and/or cascading failures to me


Notion is sluggish as well. That combined with the reports of HN potentially being slow, is there some larger network issue at play affecting a region of servers, potentially?


My feeling is some common infrastructure is failing or flailing, like some part of AWS, or some backbone provider. Too many flaky things going on at the same time to be independent failures.


There are many reports of issues with ec2 and console on down detector, doesn't surprise me that aws status page is still green.


My company monitors EC2 performance and availability across North America, and EC2 has been fine this morning, according to our data (that said, they had some intermittent issues the last 3 days).


Maybe another internet routing issue, where a bunch of traffic is going through some guys router in Albania. Or even someone is actively interfering with a root server.


Lever has been down for about the same amount of time as well (job recruiting platform).


Same for me. Data is missing in tables too.


I wondered about getting credits for the outage but you can't view the SLA page because the app is down.

https://slack.com/intl/en-gb/terms/service-level-agreement


Does Google use slack? Wanted to start my year with some extra strength tinfoil and it'd just be great if the day a unionizing initiative started the major way workers could talk about said unionizing initiative went down.

EDIT: according to a random quora post they do, so keep the tinfoil out!


I have friends at many multi-billion dollar companies who are all just twiddling their thumbs right now.


Me too. Although that's occurring irrespective of Slack's system status.


LOL! Nice, yeah mine is directly related to Slack being down. A lot of text messages right now.


Well, there goes the credibility one team has in arguing that Slack makes a great knowledge repository.


people actually argue this? slack is a great coms tool, and great BACKUP if you can't find something in a real documentation/knowledge/etc.. repository.


Where I work yes. Small and scrappy devs who when asked to move knowledge into a Confluence or wiki page, argue that’s it’s too much work to find everything they need. They can just type in the channel they want to search in with a term and get the conversation they need.

My response is if it’s important long term, it needs to be somewhere visible and exportable should the platform change. As it is now, Slack exports are horrible and large.


Lots of services quite red on https://downdetector.com


Interesting that PG&E is on the list for power outage in SF.

Wonder if that's related?

https://downdetector.com/status/pge/


Unfortunately, Down Detector doesn't actually monitor these services, so we don't know if they are truly down. Down detector relies on human behavior, and we all know humans don't act rationally.


Down Detector largely seems to track daily workplace usage patterns more than meaningful outages.


Which is great to detect common issues across many companies. For example, clicking on the cards shows that many of them are related to "network connection".


It's not, though. They're all spiking because everyone got back from the weekend. Just look at the comments.

H&R Block's page there has this as the most recent comment, from an hour ago:

> My sister was able to have a bank pull money off her card yes her old card dunno which bank ill find out in bit

Two hours ago:

> I went to atm and thought I was crazy my pin wasnt working.

These reports are entirely useless.


Slack is down. This is shaping up to be my most productive day in a long while.


Everyone back to IRC!



Funny how status.slack.com has reported Incidents and Outages for a while now, but still the "Uptime for the current quarter" is reported at 100% on the bottom right of the status table.


One of those affects money via SLA's. Slack is still up, just not usable.


> Slack is still up, just not usable.

I.e., it's down.

(And if you're saying that according to the legal blah blah blah of the SLA that this isn't technically "down", then there might as well not be an SLA.)


> And if you're saying that according to the legal blah blah blah of the SLA that this isn't technically "down", then there might as well not be an SLA.

I am because Ive had these exact conversations with cloud hosted providers/products. Never once have we been refunded according to the SLA in our contracts. Never really down (according to legal).


Up means working. It does not mean that something is displayed on the screen.


Other than the status page, I can't get anything displayed on the screen.


It may depend on how they define the "quarter". If they take the quarter as the last 91 days and round the number to the closest percent, you might not see it changed unless the outages go more than 91x24x0.5% or 10.92 hours.. It's quite subjective and a guess.


I would think that number will be updated once the fire is put out.


It's already being discussed here: https://news.ycombinator.com/item?id=25632346


I got dropped into not-dark-mode with a connection issue message in each of my workspaces.

I guess everyone hopping back online over the course of a few hours for the new year is too much to handle!


Seems like Notion has a service interruption going on as well: https://status.notion.so/

Potentially related?


Could this be some sort of data corruption? I find it hard to believe that Slack could be down for this long without something that is exceedingly hard to rollback. Even if some services are completely overwhelmed with traffic, they could block a certain percentage of traffic to decrease load, and then force servers up across their datacenters and then unblock traffic. It has the hallmarks to me of some sort of datastore is down, but obviously just a random guess.


It hasn't been that long and lots other web services are behaving a bit strangely or are down as well - https://downdetector.com/.

So its probably a wider issue affecting everyone - network level is my guess.


Not a great first day back for their ops team.


Why does the slack client not show connection issues instead of just hard locking up?


When it went down fully and I had the Windows client open, it went to a page that basically said "Slack is down, we don't know why, try restarting and see if that fixes it. Here's the status page."

It would be nice if they could fix it so that a fresh start also goes to that page, at the very least.


How do you have the Slack app installed? I currently have it installed via the Windows/Microsoft Store, and I suspect that is a significant part of the problem.


Direct download from their site.


The client on my Mac showed a page that said it was having issues connecting with a link to their status page.


The Windows desktop app is less fortunate.


> Customers may have trouble loading channels or connecting to Slack at this time. Our team is investigating and we will follow up with more information as soon as we have it. We apologize for any disruption caused.

- Jan 4, 10:14 AM EST

The status for messaging and connection services has been marked as [incident]

https://status.slack.com/


All services have now been marked as [outages]:

> We're continuing to investigate connection issues for customers, and have upgraded the incident on our side to reflect an outage in service. All hands are on deck on our end to further investigate. We'll be back in a half hour to keep you posted.

- Jan 4, 11:20 AM EST

> There are no changes to report as of yet. We're still all hands on deck and continuing to dig in on our side. We'll continue to share updates every 30 minutes until the incident has been downgraded

- Jan 4, 11:52 AM EST


I don't use Slack that much, but I know plenty of people that work on teams that are probably at a standstill, right now.

HN is also pretty slow...


HN becomes slow because people notice a service is down, and go to HN to check for more info. When Google was down for an hour a couple weeks ago, HN became almost unusable.


I think there are a lot of sites that would benefit from automatically scaling up whenever Slack goes down.


My company runs heavily on Slack. Part of my team got together in a video chat, but I have no idea what happened to everyone else in the company.


i noticed the HN slow down too. maybe it isnt only a slack issue


I read somewhere that Slack's yearly uptime SLA is 99.99%, which has already been exceeded on January 4th.

Sending big hugs to their ops team.


This is actually impressive, in a bad way. I just have become so used to being able to run highly resilient cross region infrastructure for millions of users with just a handful of people that I forget what real downtime looks like.

For their app to just go completely offline is unacceptable. Bugs and degraded services I get. But this is catastrophic.


I can't even begin to guess what went wrong. What are your guesses? How many screaming executives are there at Slack saying "just roll it back"?


>I can't even begin to guess what went wrong. What are your guesses? How many screaming executives are there at Slack saying "just roll it back"?

Doubtful it's a code issue causing a total system outage. I'm assuming they have a bunch of auto scaling infrastructure that wound down over the holidays and couldn't take the spike this morning.


Well, they did hand off slack to Salesforce


Mass server migration?


Assuming this is a bad deployment--not hardware/network issues: It will be interesting to read their post-mortem, on why rollback still has not happened yet after 2 hours of outage. You would hope that a service the level/popularity of Slack would plan for deployment-related outages and be able to roll back a deployment.


And there we have it: Relying on big companies sucks. It's great as long as it works. Once a system breaks thousands, or even millions, of businesses suffer. (Of course they are also beneficial and a private server can also crash at any time + I don't wanna blame Slack, but we always have to keep this in mind).


If a big company has million customers and the big company experiences an outage per quater, then a million businesses suffer every quater.

If a thousand small companies have thousand customers each. And these small companies experience an outage per quater, then a million businesses suffer every quater.

As the end-user-business, is it better to suffer the outage at the same time as other businesses? Is it worse?

Surely there are valid arguments against relying on big companies, but I don't think this is one of them.


> If a thousand small companies have thousand customers each. And these small companies experience an outage per quater, then a million businesses suffer every quater.

Not all companies are created the same. Microsoft, Google and Facebook have had their outages, but IME much fewer than Slack.

If there are a thousand small companies, none of them have a network effect, and those that experience more outages per quarter will lose customers to those that have less outages per quarter. So they have much more incentive to improve.

Whereas network-effect beneficiaries like Facebook (and to a lesser extent, Google, Microsoft and Slack) have much less of an incentive to improve. Who else would the customers go to?


Just a note to say "thanks" to the Slack team for the uptime when Slack is not down, it's been incredibly useful as a tool to me when other enterprise systems (Teams, Outlook & co.) have been down over the last couple of years, and especially throughout 2020.

Somehow Slack is very resilient in general. I also appreciate its UX/UI being far superior to Teams.

Ultimately, the cloud is often a single point of failure that companies become over-dependent. So I'd favour a free (as in freedom) and open source self-hosted/deployed alternative if there was one (even if it was from Slack and for pay). I agree with most on here that there isn't such a thing yet - but it's well worth building! So those of you out there who are considering implementing "yet another text editor", maybe this is something to work on.


So many large scale downtimes across multiple large companies in the past month or so. Is this for a bugfix deployment for the SolarWinds hack, or downtime caused by the hack itself ? Or some state-sponsored orgs installing upgraded eavesdropping stuff ?


Slack has been a uniquely iffy service. I wonder if there's a solid decentralized alternative.


I've had good experiences with ngircd. It's an IRC server that is very easy to self-host, and it can be installed via APT on any debian/ubuntu/raspbian etc system, and I'm sure on many others.


https://cabal.chat/ is a good program. It does not support all of slack's features, but is truly peer-to-peer so there's no central points of failure or servers that can go down. (Well, I suppose if they released a buggy version of the software and you updated, that's a central source, but that's true of most software.)


This is like the first time in a year I need slack, call that bad luck.

I hope matrix/element will rise more.


I’m just enjoying it while it lasts


Other data points:

* iMessage was taking its sweet time sending a few texts this morning.

* I had momentary trouble trying to call a business from my Verizon phone, and someone I know had trouble calling from AT&T.

Could just be a coincidence, but I wonder if something larger-scale is happening.


I came here wondering the same thing as I've encountered a handful of availability issues this morning.

* Todoist MacOS app is having trouble talking to its API


Yes, I'm wondering that also.

Todoist was having issues and iOS app launching from Xcode started taking a lot of time in the middle of the day (which reminds me of the app online check fiasco not so long ago).

Even HN seems to be a bit slower.


If anyone is looking for an alternative way for fast and seamless chat with colleagues, friends, or strangers, you're welcome to check out Sqwok (https://sqwok.im)

Although it's built as a live news discussion site versus a team messaging app, the topics can be about anything, are public, and inviting others is as simple as sharing the url of the post (mobile/desktop web).

Example (reposted this hn post to sqwok): https://sqwok.im/p/Q3-1AZFLCSpjew


It’s pretty embarrassing for their 45 minute update to be “not sure what’s wrong!”


The status still says "We're continuing to investigate", but they tweeted[0] that they have found the issue.

[0] - https://twitter.com/SlackHQ/status/1346132040249470979


Down Detector showed a lot of different services suffering downtime at the same time https://downdetector.com/

I wonder if it's an AWS region issue


I have stopped using Down Detector as an accurate measure because a lot of "outages" are just people having issues with a service unrelated to the service they are reporting as down. Ex: AT&T outage in Nashville caused people to report Xbox Live as down, when it wasn't actually down, etc.


I have issues reaching a lot of sites, especially american. Both downdetector, hacker news and others loads extremely slow or not at all. Downdetector had a bunch of failed resources for me..


Slack has been failing - hard - the past few months. Yeah, I get it, lots of remote workers - but Slack has had months now to prepare for an onslaught given the trends with COVID. Simply not acceptable.


Welcome to the party, pal!


From the status page (https://status.slack.com/2021-01/9ecc1bc75347b6d1), updated just now:

> We're continuing to investigate connection issues for customers, and have upgraded the incident on our side to reflect an outage in service. All hands are on deck on our end to further investigate. We'll be back in a half hour to keep you posted. > Jan 4, 5:20 PM GMT+1


Looks like my vacation continues!


I was seeing issues about 10 minutes before their system status page was updated. I'm surprised they don't have automatic monitoring of some kind.


Status pages are probably manually updated. You don't want a false positive/bug in your monitoring to affect your public metrics.


Fair enough. Though I'm not sure how I'd feel about the whole world knowing about my service's outage before I do.


I'm positive that they have internal monitoring, and probably knew about the issues well before they decided to manually update their status page to reflect the issue. Manually updating the status page does not equal no monitoring, after all.


Looks like Slack just updated their status page to show a complete outage, not just an incident for "Messaging" and "Connections."


IMO it's really poor Slack took 1 hour to update this to an outage, given the impact this seemingly had right from the off.

It's also extremely bad that we're 1 hour in, and they are still "investigating", with no more details than that.


Let’s take a moment and express solidarity towards the fellow engineers that are currently working like crazy under a lot of stress to fix this.


Zoom as well, no? Beautiful to see the whole team grinding to a halt. Or maybe everybody finally is getting some time for deep focused work.


For a product that is so simple, there are no good self-hosted alternatives. Mattermost and RocketChat are written very poorly, reliability and getting your data out is impossible.

Slack goes down so often we're thinking of writing a very boring clone that uses ActiveMQ and MySQL, just because chat should be boring and needs to "just work".


I was just considering setting up a Mattermost instance for our company since I used it for a year at a previous job without any issues (I was just a user though, I didn't deploy or maintain it). Just curious, why do you think it's poorly written or unreliable?


We tried running it, so a lot of experience with it, and it wasn't great. It barely stayed online.

For something so simple, you have to run a massive server, like gigs of ram and multiple core, even with a very modest user load. Take a look at the codebase, it's also a mess and impossible to fix any bugs. Finally, if you want to get your data out or report on the message activity, good luck, you'd be better off passing paper notes around. The open source version is nerfed a bit too, no LDAP authentication for instance, so it creates a lot of problems there too.


Zulip can be self-hosted - have you looked at that? I like the threads implementation.


Maybe your assumption about it being so simple is incorrect


Breaking news: Productivity hits sky high today as tech workers forced to work at home and not use Slack.

https://twitter.com/louiechristie/status/1346213038924427265...


It's funny that this isn't considered an "Outage" by their status page's standards.


Seems that it is now. It was originally just Messaging and Connections that had an "incident", so I wonder if something else happened or they manually changed the status to at least own that all their services went FUBAR.


Duplicate of https://news.ycombinator.com/item?id=25632048

I think HN is hiding these posts. Maybe status threads are discouraged now? But they're much more useful than status.slack.com etc.


> Maybe status threads are discouraged now

They always have been, since they clearly don't fit the guidelines for what a good submission is and usually leave little for interesting discussions. (unlike postmortems of past outages, which often are good)


Yet they are usually incredibly useful for most people here. They should be allowed at least while the event is ongoing.


Agreed, when a major service goes down HN is the most accurate overview, often a useful sanity check when its AWS or Slack size orgs before I open an incident with whichever party.


HN is where we all go when the Internet (or large portions of it) are down. It's more reliable than all the 'downforeveryoneorjustme' or 'downtime monitor' services.


It's the first page I try when I think I'm having connection issues at least, to verify it's some service that's broken rather than my local network


Absolutely. I come here for comments to get an idea from other engineers of what's actually going on. Way more useful than an is it down site.


They've been discouraged for some number of years, but community upvoting manages to get them to the front page now and then regardless.


Now an outage -------------

We're continuing to investigate connection issues for customers, and have upgraded the incident on our side to reflect an outage in service. All hands are on deck on our end to further investigate. We'll be back in a half hour to keep you posted.

Jan 4, 8:20 AM PST


I am so glad that at least today I do not hear that slak annoying sound. I do really think Slack is not helping me, at all, to concentrate on my job (system administrator): synchronous messages are really the worst, ever, while working: email is much much better


Honestly I'm fairly sure the vast majority of "technology" we've deployed, as an industry, in the past 10-15 years has actively made life worse. I don't know about anyone else, but that's the opposite of why I got into technology.


"It will replace email."


I mean...wasn't gmail (which effectively IS email for many, many people) down recently?


> which effectively IS email for many, many people

Doesn't have to be, though. One person doesn't even have to tie their address to a single provider, and seeing past received messages doesn't even need internet connectivity.


Just wanted to come here and say, hey! How is everyone doing? How was your Holiday break?


I managed to completely forget everything I knew about my job.

Send help.


I had 1 day off, so basically working all the time.


I'm able to connect to Slack at the moment. My company doesn't use it, but a hobby group I belong to uses it for discussion forums and their instance is up and functional. So it isn't down completely as I write this.


My biggest frustration with these outages is they're hard outages across all of Slack. There's no reasonable work arounds or fallback features.

A plaintext web interface would keep my team moving along while they resolve their issues.


Nothing like a reminder of how dependent you've become on Slack for communication (and archival of conversations) like an outage on the Monday after the holidays when you're not on your A-Game yourself.

"Let's see, I'll look up so and so's name with Sla.... shoot"

"Okay, I'll just find that thing I .... nevermind"


The joys of multi-tenancy.


> We’re still investigating the ongoing connectivity issues with Slack. There's no additional information to share just yet, but we’ll follow up in 30 minutes. Thanks for bearing with us.

Seems to be working intermittently, however.



I think this is due to AWS. Not only Slack is down (e.g. Notion). AWS status page didn't show anything yet, but wouldn't be the first time. The last Kinesis crisis didn't show up for hours.


I feel obligated to mention this, which was posted a mere 8 days ago.

https://news.ycombinator.com/item?id=25550685


Slack down, productivity up!


Service Interruption as a Service.


Atleast their status page works ¯\_(ツ)_/¯ (Looking at you AWS).

I am really looking forward to a better competitor taking over their market share, I presume things will only get worse after Salesforce acquisition.


Can't handle the post-holiday surge, or people wanted to justify their long holiday and pushed something only to witness their holiday optimism head-crashing on the surface of the reality?


We got a discord server as a backup when teams is down, seems like it has gotten worse with days being down entirely, and we have to resort to discord voice which always seems to be up..


Todoist reports that is down as well [0] I wonder if it would be connected in any way shape or form

https://status.todoist.net


A good time to host your own slack-like chat with mattermost instead


Good reason to try matrix or even setup a reserve (matrix) channel.



PACE = Primary, Alternate, Contingency, Emergency

If you haven't been able to justify testing your PACE plan with your bosses lately, now's a great time to go ask again.


Early 2021 downtime jeez good luck ops team i believe in you


Well, they're mostly back up now. I'm curious to see what will come of the postmortem, and if that report will be made public, even if only in part.


We're using Google Chat again. Feels ancient. https://chat.google.com/


If you're using gsuite already, it's a usable failover. I send all my alert notifications there, as a fallback already. Dragging people in was trivial. It's better than the group SMS that one person tried to use.


Same - and it's just... weird. The "everything must be in a thread" model feels really clunky.


That's a good point, however threads do tend to help with keeping things organized when in a channel.

It actually might be a good thing that everyone doesn't feel the need to look at slack every X minutes.


Subjective... I've found Slack and co's interspersed conversations far too chaotic, and temporal; threading is a great way of organising many different concurrent topics.

And to be clear I don't mean Slack's implementation of threads which is hiding it away in a separate panel and which doesn't get used by everyone either.


I find Zulip to be a nice balance. Threads are much more prominent than they are in Slack, but aren't clunky.


Slack/Notion. Stop with the features already. You're killing yourself in slow motion. Focus on performance, robustness of your infrastructure.


"It was working fine when we sold it to you"


> While the issue is largely still ongoing, we believe some customers may see improvement in connecting to Slack after a refresh (CTRL/CMD + R).

Nice.


I'm setting up backups for our company on discord. That way maybe some webhooks won't be working, but communication resumes.


I noticed something about dead slack channels on the GCP console last night, which I thought was odd. Anyone see something similar?


Just seems slightly disingenuous to me to have "100% uptime" on the same page that says there is a current major outage.


Especially when it's "uptime for the current quarter." A three hour outage since Jan 1 is already a 3% downtime automatically I would think...


This is my SignalR alternative with end-to-end encryption. Choose a password and the file and message will be encrypted in client side using that password.

URL: https://symmetric-crypto-chat-room.herokuapp.com/

Repo: https://github.com/amir734jj/SymmetricCryptoChatRoom


Surprisingly mobile client is working for me


Yay Salesforce!


$28B well spent!


We just had to route around this so we're trying out chat.google.com for the first time. Seems ok.


Like human beings, can we all imagine that every single service can have a couple of off days a year.


The entire point of all of the engineering we talk about around here is to produce services with inhuman capabilities and resilience.


Doesn't that feel like an oxymoron.

If a human created it, it can never have in human capabilities


as an alternative - Cisco has - webex, less known, but does the work: https://status.webex.com/service/status?lang=en_US


Notion is down for me (and others) as well. Is there a cloud outage somewhere?


I just hope slack itself has a backup chat tool for incidents like this.


we run a copy of this as a super duper backup: https://github.com/shazow/ssh-chat


status page says apps / API are fine yet am trying to work on a slack app but can't cause of these error, a bit annoying

edit: it is now showing as a total outage on the status page


It seems weird to say there are issues with connections but everything else is working fine. Like is the API technically fine on their system metrics but no one can connect to use it so it stays as green? Doesn't really help much in practice it seems if connections are having issues and everything is unusable in practice to keep them as green.

Would be similar if auth was down. You can connect to us, you just can't authenticate so can't actually do anything.

Edit: Looks like they updated the status to properly show an across the board outage


Salesforce bought Slack, so maybe it's enterprise now.


Considering how ubiquitous slack use seems to be in a lot of major tech companies, I wonder if it's reasonable to ask whether or not the stock market's performance this morning is somehow correlated?


Give Telegram a chance. It's worth it! telegram.org


Do you recommend it for use by teams?


Here is to hoping it's not SolarWinds related eh!


me too


I just got an invite to join my company’s Google chat.


Time to talk to my coworkers in person... dry heaves

/s


I'm fairly certain that dry heaves are not a COVID-19 symptom, at least.


Never deploy a new release on Friday and Monday!


Surprisingly mobile client is working for me..!!


It's off an on. The connectivity is spotty.



I’m going to get so much work done today


Any opinions on using Discord vs. Slack?


I personally have found that one of Discords major shortcomings is the lack of support for threaded message chains. For those times when you may have 2+ parallel conversations in a channel you end up dramatically reducing the ability to effectively communicate.


Should be an interesting post-mortem...


And... it is back (for us at least.)


This is definitely going to catalyze a nascent move over to Discord for my team. (~80 person consulting agency, distributed)


Which would be unfortunate if based only on evidence of Slack being down today, given how many other sites are down as well. (Discord is up, though!)


it's not based on only that. Slack costs a lot of money and moving off of it is something that has continually come up over the last year or two. We even had a Rocketchat server up and running for awhile.


That's it, I'll move from [closed source, centralized, paid service] to [closed source, centralized, paid service]!


I'm migrating to Rocket.chat on Digital Ocean as we speak. Has anyone else made the move or tested Rocket.chat?


My old company used a mix of Slack and RocketChat. Functionally, it's fine but I was never a big fan of the UI and how attachments were handled. Also, cross-channel search was kinda bad. Mind you, this was well over a year ago so I'm sure things have improved.


Notion is also down right now.


which tool did they use to communicate during the outage?


Seems to be back up for me? Status page has not updated yet though.


Time to slack


Obligatory BGP hijack prediction, since there seems to be a bunch of other sites down too.


time for nap


oof.


#hugops


This site seems to be lagging as well. Or is it just me?



It's struggling. All the people freed up that think chatting in Slack is productivity?


Witnessed this when Google went down in December. Seems like a thundering herd problem with tech folk flocking to HN for updates/discussion/gloating


It isn’t just you, but due to Hacker News’ right-sized[0] infrastructure, you should sign out unless you need to comment. That way you hit the caches instead of getting the server to make you a new page.

0: https://news.ycombinator.com/item?id=12911461


That’s the wrong way to look at it. If HN struggles in certain situations then it is not right-sized. You don’t beg of users to walk an unintuitive happy path (i.e. logout when not commenting).


It is “right-sized” if it happens so rarely that it realistically causes no problems.


Rare is not never.


But never isn’t necessarily a reasonable goal. People will tolerate outages when it’s not lost money/business.


Is this still true? Asking because that comment from dang is 4 years old.


It's quite fast for me.

Edit: Not consistently, I guess. 9 out of 10 times it responds instantly, then it lags once in a while.


It is the replying that is particularly slow for me.


Several sites and services are lagging or slower than usual for me (facebook messenger, news sites, google)


Other than it being days of slow news, with top stories seemingly pinned for days now and boring, no ;) you know, where Slack being down is considered newsworthy (yawn).


Not just you. It is slow for me as well (Calgary, Canada).


In the other threads, there's been some theorizing that some central infrastructure is down or struggling. Perhaps AWS or the like.


It could just be an effect of people switching to "alternatives" from slack, effectively DDOSing those services. Notion just went down as well


Same here (London)


Me too (New York).


Same here (Berlin).


Struggling for me as well.. reddit is as well. But I'm finding other sites are just fine.


Who knew Salesforce could work this fast :p


We were just joking with the work mates -- SF bought Tableau 2 years ago and haven't ruined it yet, only because it takes them that long to do anything ;)


Can't ruin something that's already ruined.


They had the same issue 3 months ago: https://news.ycombinator.com/item?id=24687957


Indeed.


[deleted]


It's snarky - snarky comments are against the guidelines here.


Hopefully it stays down forever


Why would you use cloud for that ? We have mattermost at my startup


oh good, it's not just me and my already bad first day of the year... is it too early to start drinking?


> is it too early to start drinking?

Depends on the timezone you're in, though one could theoretically cite a disparity between physical and mental/emotional/temporal time zones...


You should probably ask Jimmy Buffett

https://www.youtube.com/watch?v=BPCjC543llU


no




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: