All: large threads are paginated, especially today when our server is steaming. Click More at the bottom of the thread for more comments, or like this:
(Yes, these comments are an annoying workaround. Their hidden agenda is to goad me into finishing some performance improvements we're badly in need of.)
When I was at Uber, we noticed that most incidents are directly caused by human actions that modify the state of the system. Therefore, a large "backlog" of human actions that modify the system state have a much higher chance of causing an incident.
My bet is that this incident is caused by a big release after a post-holiday "code freeze".
To elaborate a bit more on this point, you have to think about it like any complex system failure - it's almost never one thing, but rather a combination of many different factors. The factors around post NYE releases:
- high risk changes that weren't released pre-holidays get released. Depending on the company, this could mean a 1-week to 1-month delay between implementation and release. The greater that interval, the higher the divergence world of production and the world of the new feature
- lots of new hires (new year = new hiring budget). New hires are missing some tribal knowledge about the system and make a production-breaking release.
I tried to think of other reasons, but these two overwhelmingly stand out as the two biggest reasons. Would love to hear from others.
If new hires tends to break production, it's not in the first business day of the calendar year. December gets really quiet for recruiting, typically, as candidates get busy with their social lives, and scheduling interviews gets harder.
January is busy for recruiting, but given a week or two of interviewing and negotiating, two weeks notice, it's probably February before new employees are starting, and they're not making big, production-damaging deploys for a week or two after that.
You will also get a pause in new hires in late December for the same reason. I've certainly accepted an offer late in the year and then didn't start until the new year.
Probably not as big of a rush as the end of school year rush in summer though.
I also doubt that new people will be breaking production on day one. Even at a fast moving startup I'd expect it to take a bit to go through the onboarding paperwork, get a laptop and actually try pushing a change to production.
I think some big company (maybe Facebook) has this rule that you had to deploy something to production on your first day. They seemed pretty confident in their processes and devops teams. A company trying to imitate that policy without doing the work necessary to make it possible would probably have outages on days when lots of new people joined :-P
Could be Facebook as I think production releases are always rolled out in phases e.g. first to 10 users, then 100, then 1000 and so on. That means there's much less chance of even the worst mistake having a serious effect.
Wow, onboarding new hires here is going good, if they can access slack, O365, LDAP, VPN and clone the repo by the end of the first day. Tho we have the initiation ritual of installing the OS to your laptop.
Doubtful. It's not impossible a company the size of Slack would be reliant on a specific engineer logging on in the morning before a traffic spike so the service can handle the spike in load, but that's a misuse of modern distributed cloud-based computing.
Hate on the cloud all you want, but AWS has (several flavors of) load balancers and various ways to automatically scale up and down resources (and if you're conservative, you can disable the 'down' part). If you're operating a major SaaS company like Slack and not taking advantage of them, something's gone wrong.
It's easy to fall behind on bumping up the high watermark for your max autoscaling or for new traffic patterns to cause emergent instability. New code paths are taking unprecedented amounts of traffic all the time.
In 2021, how does one keep track of resource starvation at the process, container, os, service, pod, cluster, availability zone and region levels?
I would add here the potential scaling issue - holidays were a dry season - less meeting. So if they have some automation for scaling down to reduce cost, it may have bitten them in their arses now.
People came back to work, and most of them start around the same time (US wise at least).
Hence kids - a vital lesson for all of us - don't start the call at a full hour, give it 3-7 min to make your coworkers confused and give some time for the systems to auto-scale ;)
I think you're right on the first bullet, but not the second. If it was mid-Feb, then maybe, but the next FY hasn't even started yet for a ton of companies, let alone onboarding newbies to production.
Yeah, makes sense. A system typically optimized for performance and real time delivery is suddenly asked to perform multiple batch retrievals in large chunks. Ouch!
I would bet it's just the influx of traffic post holiday with systems that haven't been updated in so long maybe some annoying memory leaks have crept up and gone unnoticed or some other bad state that was exacerbated by return to work day for most NA folks. Code freezes were good at identifying bugs that only show up after long periods.
Doubt anyone releasing big changes Monday morning.
I haven't worked at Slack, so I can't speak with high confidence. A traffic spike is a possible reason, but I'm willing to bet that it's not the reason:
> Doubt anyone releasing big changes Monday morning.
This is definitely an engineering best practice, and by best practice, I mean something that Uber's, I mean Slack's SRE team strongly pushed for, and got politely overruled on. After a code freeze is lifted, it's quite common for lots of promotion-eager engineers to release big changes.
IMO it really doesn't have to be promotion-eager engineers or antsy product managers. I'm fairly satisfied with my role and comp and work type with where my career/life-stage is. I just did a code release first thing this morning, not because I am promotion-eager, but just because I'm picking back up where I left off, like any normal day. Granted I work at a much smaller company than Slack with orders of magnitude less traffic.
I'm not sure about that. I feel like I get more upvotes from sarcasm and jokes than from insight. In this instance, I think it's because when people hear something dumb said seriously in real life, they're not going to readily recognize online that it's a joke.
Why? I had a rewrite of some core logic the last day before Christmas that I didn'td deploy, as it wasn't time critical to get out and I didn't want to be disturbed during holidays. Today it was perfect to deploy, as I can watch it the whole week if needed.
Well, I think it probably depends on where you work. At my work, people just took 2-3 weeks of time off. It takes a moment to get your head back in the game.
Everywhere I've worked often has a massive backlog of things that get released after a moratorium or extended holiday week. Those are usually the worst weeks to be oncall since things are under so much churn.
Interesting, I've never worked anywhere where engineers decide when to release changes. That's a product decision, and there is a process of review and approval at both the code level and the functional/end-user-experience level that has to happen first.
Did you mean that literally? E.g. is it common at Uber that engineers can release changes to production on their own?
At Cisco (Webex team), the engineers decide when to release code, and most features are enabled by configs or feature flags independently of the deploys.
The engineering team is responsible for the mess caused by a bad deploy, so it's appropriate that those engineers should also choose the timing.
Our team typically deploys between 10am and 4ish, local time, since that's when we're at our desks and ready to click through the approvals and monitor the changes as they go through our pipelines.
The feature enablement happens through an EFT / beta process, and the final timing of GA enablement is a PM decision. But features are widely used by customers ahead of that time, as part of the rollout process.
Our team usually rolls out non-feature changes to services via dynamic configuration switches, so that we can get new bits in place, and then enable new behavior without a redeploy. This also enables us to roll back the dynamic config quickly if something unexpected happens.
(We generally don't do this for net new functionality; there's lower risk in adding a new REST endpoint etc. than in changing an existing query's behavior or implementation.)
Does Uber/Slack not release in CI/CD? At least in backend?
I don't see any need to deploy a big change at once in the software world today. At worst feature gate the thing you want to do and run it in a beta environment, but still push the actual code down the pipeline.
I'm actually more confused after reading that. I assumed that you meant that tested in production on purpose, but it sounds, at a skim, like they do non-prod testing environments - in fact, it looks like they've gone to having multiple beta environments of every service?
My understanding is that they have a "tenancy" variable in every service call which can take a different code path. They seem to only have one environment for everything and do tests/experiments at code level based on this variable.
That might be true but when you take the global usage of Slack and their respective time zones, more than half the world would have signed into Slack this morning before SV had and I certainly didn't notice any downtime this morning in my time zone.
What would make that strange? Where I work it is frowned upon to do releases on weekends and so bad changes due to buildups happen on Monday.
Although, we also don’t close the pipeline for just any holiday break. In fact low holiday traffic is a good time to keep pipelines open, since changes will impact less people.
I have definitely worked in places where the times right before and right after a change freeze were the most unstable, so that could be it. However, as others have mentioned, it's pretty early on the west coast of the US. Unless some engineer was up extra early (perhaps at the behest of an anxious project manager) it seems unlikely to be a release.
What it could be is some engineer somewhere coming in after the holiday, noticing a slightly flaky thing, and thinking, "I'll reboot/redeploy/refresh this thing so the flakiness doesn't get worse". Only it turns out the flaky thing was a signal of something else falling over. Or maybe the redeploy was the wrong version because of bad CI/CD, or maybe the person just fat-fingered it.
It varies a lot by team... I think it's common to have a single click "start" button to press. It's a good sanity check that a release isn't going to happen during a fire drill, outage, or strike...
Very possible. I don't know what Slack's workforce distribution is. In places I've worked there have definitely been some incidents in US off-hours triggered by someone on the other side of the world.
Another common cause is resource exhaustion as a result of poorly monitored resources (or bugged monitoring). For example Google's authentication was down because their system reported (wrongly) available quota of 0. The last two incidents at my company were also related to resource exhaustion.
This is one of the original concepts why to go capital-A Agile. Make smaller releases more often, so at least if something breaks, it's (hopefully) something small, and least it's easier to trace.
(I'm not making a statement if that's good or bad or if it works or whatever. Please don't read an opinion into it.)
This. If you roll many changes into a single deployment, you don’t know which change broke what. But if you have two or three weeks of commits waiting, it’s hard to do otherwise.
That's why good regression tests and CI are so important; in an ideal world (which we were close to in one of my projects), every change is pending in a pull request; the CI rebases the change on top of its upstream (e.g. master/main), simulating the state the codebase will be in once merged, and runs the full suite of tests. The build is invalidated and has to be re-run if either the branch or upstream is changed.
Now, caveats etc, this was a collection of single applications in a big microservices architecture, and as the project grows it becomes more and more difficult to manage something like this, especially if you get more pull requests in the time it takes to do a build. But it is the way to go, I think.
Anyway, since tests and CI are not definitive, you also need a gradual rollout - 1%, 5%, etc - AND you need a similar process for any infrastructure change, which gets more and more tricky as you go down to the hardware level.
An incognito browser would ignore all client-side cookies, so the Slack web client would not try to - say - resume a previous user's session or re-use any previously saved data.
Likewise, incognito mode will also ignore most cached web content, meaning all assets on the Slack web app will get loaded again from scratch. This "clean state" start could, theoretically, get around issues with old - potentially incorrect/outdated - assets being loaded, even though that really shouldn't happen under most circumstances.
It means that one is not sending a session cookie of any kind, thus should be sent to a 100% cached version. No "Are you XYZ and what to log into ABC's Slack again?" box.
I'd like to take this moment to mention self-hosted, open source, and federated alternatives like XMPP and Matrix.
I'd like to, but unfortunately I don't feel like I can in good faith. Matrix is woefully immature, and suffers from a lot of issues, but I think is closer to being a functional Slack/Discord alternative. XMPP is much more mature, and works very well for chat, but doesn't have a nice package that does all the Slack stuff--at least not that I'm aware of. I'd love to be proven wrong there. I know it can be done, but if it can't be deployed quickly by an already overstressed team member, what chance does it have?
The problem is that XMPP and Matrix are protocols, not products.
Element (the primary Matrix software) definitely has Slack and Discord in its sights.
I don't think there are any serious "self-hosted Slack-like" contenders that are XMPP-based right now. You can piece components together (yay, standards!) and I did exactly this for the IETF's XMPP deployment recently. But it's far from being a cohesive easy-to-deploy product. Simply because nobody is building that right now. It takes time and resources and there's no money in it.[1]
People who do set out to build Slack clones (projects like Mattermost and Rocket Chat) and earn money don't have features such as federation on their priority list and don't build on top of Matrix/XMPP. They roll their own custom protocols and as far as I can see they are fairly content with that decision.
[1] There's even less money it, but nevertheless I am currently working on such a self-hostable "package" for XMPP. However rather than focusing on the team chat use-case (Slack/etc.) I'm focusing on personal messaging (WhatsApp/etc.): https://snikket.org/ if you're interested. It's possible I will broaden the scope one day.
It's largely overlooked that the success of Slack & MS Teams is partly due to the cybercrime portal that email has become. IOW, you don't get phished in your org's Slack chats. To prevent phishing, any chat service will suffice; an open protocol isn't necessary, as you don't intend to engage with ppl outside your org.
The essential problem IMO is how to replace SMTP. No one has proposed and implemented an alternative, to my knowledge. So I decided to[1]. The current draft omits federation (although I wouldn't rule it out in all cases yet).
No, EMail has fundamentally bad UX for a lot of use case slack and similar are used for.
> problem IMO is how to replace SMTP.
Sadly SMTP is probably one of the parts of Mail which have aged best. Enforcing the usage of some (currently by design optional) features wrt. authentication and similar at the cost of backwards compatibility and you have all you need from the delivery protocol.
BUT:
- IMAP and similar is much worse.
- Mail bodies are a big mess it's always fascinating for me that mail interoperability works at all in practice (again you can clean it up a lot, theoretically, but backwards compatibility would be gone).
- DMARC, DKIM and SPIF which handle mail authenticity have a lot of rough corners and again for backward compatibility are optional. Again it's not to hard to improve on but would brake backwards compatibility.
The main reason mail still matters is because it's backwards compatibility, not just with older software but also with new software still using old patterns because of the (relative to the gain) insane amount of work you need to put into all kinds of mail related components. But then exactly that backwards compatibility is what.
(Yes, I have read the "Why TMTP?" link and I have written software for many parts around mail including SMTP, and mail encoding. The idea that SMTP is at the root of the problem seems to me very strange. Especially given that like I mentioned literally every other part of mail is worse then SMTP by multiple degrees...)
EDIT: Just to prevent misunderstandings one core feature of mail is the separation of mail delivery and mail authenticity, in the sense that you don't need the mailman to prove the authenticity of a mail. At most the legal/correct/authentic delivery.
By "replace SMTP" I mean the whole email protocol stack, not only SMTP. I'm not proposing to replace it for all situations overnight; of course SMTP etc will be used for decades.
TMTP also covers most IMAP/POP use cases. And it allows short, plain-text messages (see Ping) to make first contact with others -- necessary when that server has less restrictive membership requirements.
Authenticity is a double-edged sword. For certain confidential content, you want the recipient to know that it originated with the sender, but you don't want anyone else to know that in the event the content is leaked or stolen.
I believe the extinction of email for person-to-person & app-to-person correspondence is a foregone conclusion, due principally to phishing. The question is what should we do now, and the answer is clearly not chatrooms (which are of course useful in certain circumstances).
Email is not a chat system, and chat systems are unsuitable for asynchronous long-form threadful discussions. There is some overlap, but combined they form a spectrum of communication modes so wide that it can‘t be covered by a single UI.
I would argue that email is not suitable for asynchronous long-form threadful discussions. The limitation that email has is that if you're not part of that conversation from the beginning, you'll have to piece it together from previous quoted material.
One email like protocol that properly handles this is NNTP.
True regarding the late-comer aspect, although it is less of an issue when using mailing lists with an archive. In the past, when lacking an archive I also just asked another participant to send me the earlier discussion in mbox format, which was easily accomplished with the unix MUAs of the time.
Regarding the actual modes of discussion I was thinking of though, usenet and email are mostly the same.
> I was thinking of though, usenet and email are mostly the same.
For the most part, they are and many readers support both protocols (or at least they did in the past). The nice thing about NNTP is that it doesn't require maintaining a separate archive or having someone send you an mbox file to import. Just subscribing to the appropriate groups was sufficient (depending on the article retention policy).
Why would you replace it? Will not disabling all public un-authenticated submissions on your mail server suffice? You can also prevent delivery to outside world (and error out on submission so that users are notified) if you really like. Result will be your own private mail server.
And you can keep using all the normal MUA's on desktop and mobile.
Changing your SMTP server configuration that way would break things, so the question is whether to set up a new, company-internal SMTP server, and give your employees new addresses there. But that won't quickly stop the phishing, because your ppl still need to get email via the public network from clients and suppliers.
Setting up a new server isn't easy unless you hire an outside service provider, and if you're willing to do that, Slack et al offer a nicer UX than the well known email/webmail clients.
Orgs with sufficient IT resources commonly do run internal SMTP servers.
Yes I'm old enough to remember when organizations had email but it was internal-only. Probably less for security reasons at the time than that they simply didn't have an internet provider. There were also mainframe-based email systems that were internal to that network.
You're making some fundamental assumptions about federation that I think are completely wrong. Are you telling me that you never need to communicate with anyone outside of your organization? How do you intend to receive invoices? How will you communicate with outside vendors? Sorry, but you need some text-based way of communicating with people and email is the best way, that's why it's survived so long despite being problematic. If you have internal, asynchronous chat, why would you need internal email?
Sorry my dude, but business runs on email. Saying lets get rid of it is as naïve as saying lets get rid of Excel. It's just not going to happen.
> To prevent phishing, any chat service will suffice; an open protocol isn't necessary, as you don't intend to engage with ppl outside your org.
The same could be accomplished with email if you only allow connections to the SMTP and IMAP server from within the corporate network. That is, nothing external can connect to those servers, which is fine if it's only used for internal communication.
XMPP is supported by a large number of clients, but running a server and getting everyone on clients with comparable featuresets is a nightmare. It’s a cluster of disparate standards, and it’s overwhelming. I’m sure it’s doable if you have the time to invest, but it’s not straightforward if you’ve never done it before.
Matrix is pretty straightforward on the server side of things, but the client UX is invariably mediocre. Vector—the official client—exemplifies everything that is wrong with Electron apps. Slow, clunky, poor UI, poor platform integration. With the default home server, it can take seconds for a message to go through. At least it’s far more customizable than Slack; it has an option for everything, which, as a power user, I quite like.
I haven’t tried Mattermost, but it looks like some of the important features aren’t FOSS, at which point it’s just another Slack as far as I’m concerned. I’ll gladly pay for support, but for SSO? Meh, might as well stick with Slack; at least everyone and their dog knows how to use it. (This is, of course, an opinion that stems partially from ignorance; I haven’t actually tried Mattermost, and if I do, I might fall in love with it. But my time is limited, and I can only evaluate so many products in a day.)
Not that Slack is much better here: their threading system has so many UI/UX issues. Ever had a thread with hundreds of messages? For your own sanity, I hope you haven’t. Ever tried to send an image to a thread from iOS? It’s possible, but only by pasting the image into the text field; the normal attachment button isn’t available, and Share buttons in other apps can’t send to threads. And, of course, the recent uptime issues.
Element (formerly Riot/Vector), has improved loads over the years, and the default matrix.org average send time is around 100ms these days rather than multiple seconds: https://matrix.org/blog/2020/11/03/how-we-fixed-synapses-sca... has details. I suspect you (and the parent) may be running off stale data.
That said, Element could certainly use less RAM, irrespective of Electron - and http://hydrogen.element.io is our project to experiment with minimum-footprint Matrix clients (it uses ~100x less RAM than Element).
Rather than storing state from the server in the JS heap, new state gets stored immediately in indexeddb transactionally and is pulled out strictly on demand. So, my account (which is admittedly large, with around 3000 rooms and 350K users visible) uses 1.4GB of JS heap on Element/Web, and 14MB on Hydrogen. It's also lightning fast, as you might expect given it's not having to wade around shuffling gigabytes of javascript heap around the place.
I've wanted to try something like this (on a smaller scale), but haven't had time. It's good to hear of an implementation that reflects my expectations. How long did it take you to migrate over?
It has, and I’ve been using it since its early days. I still use it. It’s still terrible, just slightly less terrible. And, no, messages don’t consistently send in 100ms on the default home server; there are regularly disruptions that cause significant delays, sometimes as much as 10-20sec. That’s a big problem for a federated chat platform.
Edit 1: I want to love it; the design is everything I could ever hope for in a chat platform. I even tried to contribute to Vector, but it was such a mess that I eventually gave up.
Edit 2:
> That said, Element could certainly use less RAM, irrespective of Electron - and http://hydrogen.element.io is our project to experiment with minimum-footprint Matrix clients (it uses ~100x less RAM than Element).
I'm not sure why this is a priority. Techies complain about RAM usage a lot, but if we have to choose between performance+power and a small memory footprint, we're going to choose the former almost every time. Take Telegram, for example: they have a bunch of native clients that perform amazingly well, although they do gobble RAM. Most of my technical friends use it as their primary social platform. It's not without issues, but it's really hard to go from something like Telegram Desktop or the Swift-based macOS Telegram client to Vector. And those clients aren't made by large teams--most (all?) first-party Telegram clients are each maintained by a single developer, if I'm not mistaken.
The constant rebranding and confusion over Matrix/Vector/Riot/Element is another point of pain for me. It’s incredibly difficult to communicate unambiguously about Matrix with people who haven’t been following it for years.
Does Element refer to the ecosystem as a whole, including EMS? The primary client? The core federation? It’s not obvious from a casual visit to element.io. I suppose if I said “Element web app,” that would be fairly clear, but I’m still in the habit of saying “Vector” from the days of Riot.
Everything related to the company formerly known as New Vector is now called Element. The company is Element, the official clients are called Element (with suffixes Web, Desktop, Android, iOS) and yes, EMS is Element Matrix Services. This rebranding was done specifically due to the confusion brought on by the many previous names. More info here: https://element.io/previously-riot
Yes, I actually like the change—I think they finally got it right this time. (The Riot rebranding was a mess.) However, it’s still frustrating when trying to communicate with people who aren’t following Matrix-related news. In my circle of friends, “Vector” remains more widely understood than “Element Web,” so that’s what I’ve been using.
Anyway, my point stands: Element Web/Desktop feels fairly unresponsive compared to something like Telegram Desktop. It looks so much nicer now, the UI layout is great, and it’s far more powerful than Telegram—yet, I can’t help but feel like I’m swimming through molasses even when dealing with moderately-sized groups. Try clicking around on different groups rapidly; you’ll likely find that you have to wait several seconds for the UI to update.
>XMPP is supported by a large number of clients, but running a server and getting everyone on clients with comparable featuresets is a nightmare.
XMPP is, well, extensible. If things don't match in the clients then that particular feature just doesn't work. These days all the clients pretty much try to match the feature set of Conversations. That applies to the servers as well. There is a server tester for that:
I recently finally found someone to try Element with and the experience has been great so far. (Except for the need to go through Google's captcha at some point.)
It even sent a 20 Mo MP4 like a champ, while Conversations sometimes chokes on not that high resolution photographs...
I can only offer my own personal experience: Matrix has been working well for me for a couple years now. However, I probably have a more narrow use case than you're thinking of.
I run a small homeserver and use it to communicate with a group of about 20 friends. Most of them aren't "technical" people. We use it mostly for chatting and image/video sharing. We never use live calling (audio or video).
There have been a few bugs in the mobile apps, but for the most part, everything has been working fine.
The biggest issue is the UX. It's not as polished as the big players.
This is actually the use case I've been trying to get to for some time. Unfortunately, I need it to "just work" to get my non-techy friends interested, otherwise they'll go right back to Discord.
Like I said, it's close, I just don't think it's there yet.
I'd say it's almost in "just works" territory for everyone except the person who has to actually administer the homeserver (me). I absorb a lot of the complexity for my friends.
The only thing that's a little cumbersome is requiring them to enter a custom server URL when the register/log in for the first time.
For competing with Discord, it seems like it would benefit from a more robust free offering to compete with Discord. Being able to create a free Discord server is great, and it is incredibly capable for most communities that don't need the fancy perks of Discord Nitro etc.
> The free Discord plan provides virtually all the core functionality of the platform with very few limitations. Free users get unlimited message history, screen sharing, unlimited server storage, up to eight users in a video call, and as many as 5,000 concurrent (i.e., online at the same time) users.
For a lot of small communities that aren't focused around commerce of any kind, Discord's free offering blows Element Matrix Services out the water. It's a non starter. If I could create a server with feature parity to Discord's free server, any new community I'd create I would definitely jump on EMS in a heartbeat, and I'd start trying to recreate communities currently within Discord, to be on EMS.
So like a very normal progression for Discord servers is that some niche sub-community wants to gather, and so they create a free server, and people join and there's all kinds of rich content that gets posted and curated and great discussions and then as it gets bigger, people running the community or people who want to support the community will boost the server with Discord Nitro for additional features like more slots for custom emojis (I can't communicate enough how important of a feature this is to Discord's success, even though it seems like minor window dressing).
That kind of model is what would justify a server starting to shell out money every month for EMS. I would note that Discord's pricing for this kind of level of community is tiered and not a per-user thing. You unlock more features based on how many users are paying for Nitro, going up a tier based on breakpoints of 2/15/30 Nitro Boosts per month. It doesn't cost more to have a tier 3 server if you gain more users. This is a big deal for fostering growth and unseating incumbent social networks (which is what Discord and Slack are).
Just some thoughts. I really want stuff like Element/Matrix to succeed!
and use it to communicate with a group of about 20
friends. Most of them aren't "technical" people.
I'm insanely curious about the human side of things here. How did you get them to buy into this idea in the first place? That sounds like quite an achievement.
The non-technical folks in my life generally struggle with paths of least resistance (iMessage, etc) and it's hard to imagine getting them onto some alternative platform/protocol.
It did take some persuading. I think the main reason I was able to pull it off was ironically because they're not that technical. I bet most of my friends don't even know what Slack or Discord is. That's not to say they're dumb or anything - they just don't spend as much time online as one would think.
Previously, we were mostly using group texts or Snapchat/Instagram to communicate, so the biggest selling point was the fact that we can share full quality pictures and videos between iOS and Android people.
This is awesome. I have always wanted to self-host a Matrix instance as well, but I imagine it's going to be very hard to convince them to move over, from Telegram. Is there a blog post that I can read about homeserver setup? I am keen on seeing how easy it was, and keen on seeing what level of technical and financial resources you had to invest to get going.
For my part, I don't have buy in yet (Because I'm not convinced Matrix is ready) but I think I could get it. I have 7 or 8 friends who do not use Discord except to talk with me and a few other friends that I know can be convinced to at least start using Element next to Discord. Once I feel like my homeserver is in a state that I can invite these non-technical people in, I'll be in the same place.
You bring up a good point, however, which is that we _could_ use open source, non-centralized alternatives for many of the online products we consume, but we choose not to, and so we increasingly become slaves to corporations that actively seek to narrow our choices. Another example of this is the push from big sites like Reddit to use their apps rather than just use a browser - it’s not about functionality, it’s about destroying the free and open web.
> You bring up a good point, however, which is that we _could_ use open source, non-centralized alternatives for many of the online products we consume, but we choose not to, and so we increasingly become slaves to corporations that actively seek to narrow our choices.
That doesn't happen for no reason. The vast majority of open source products I've used have terrible usability. I simply don't want to use them. I don't want to be beholden to corporations and walled gardens, but for me, the existing alternatives are worse in too many ways.
Or, or... and bear with me here... or, packaged click-button solutions with paid (contractually obligated) dedicated product support is a better use of our short time, more often than not.
That only works if you only need to use Slack alone or whatever. The moment you have to use more of these annoying services at once and manage N different stupid client apps for Y different platforms (desktop/mobile), the lack of open/shared protocol becomes a major issue. Let alone if you want to use them on emerging mobile OSes that are not a hellhole of data thievery.
Everything goes down. But it looks like huge complicated distributed services shared by huge amounts of people, that are continuously updated and developed, and are constantly trying to attract more users/load, seem to go down more than a simple service on a simple server.
No hard data though. My mail server only ever went down when I upgraded the server and didn't check that everything was still working right away, or similar maintenance induced incidents. It never went down by itself.
Such systems only ever go down unpredictably on HW issues, or when overloaded/out of resources. Neither is very likely, because you're not trying to grow your service in any sense similar to VC backed enterprises. Most of the time it has constant very low load and resource use. And you can simply stop introducing changes to the system if you need more stability for some time. (stop updating, for example)
XMPP killed XMPP. Its just not very good. It doesn't work well between different clients and servers. The protocol is a horribly overcomplicated mess of overlapping, partially supported extensions for basic functionality. And it doesn't work at all with low power mobile delivery. (It was invented before the iphone.)
There might have been political reasons why google dropped XMPP, but it would also make sense as a purely technical decision.
> And it doesn't work at all with low power mobile delivery. (It was invented before the iphone.)
This is plain untrue. Yes it was invented a long time ago, but thanks to the extensibility it has evolved over time just as the way people use it has changed. This evolution is a healthy and necessary part of an open ecosystem.
I know it frustrates people that modern features don't work in stagnated clients such as Pidgin and Adium, but modern clients support all the things you would expect.
Servers and mobile clients have supported mobile-friendly traffic and connection optimisations for many many years now.
> There might have been political reasons why google dropped XMPP, but it would also make sense as a purely technical decision.
Google contributed extensions to XMPP, the same way they contribute to other internet standards. I think they were quite comfortable with this. The XMPP-based Google Talk was their longest-running messaging solution after all...
> And it doesn't work at all with low power mobile delivery.
What makes you think so? If Conversations was draining my battery, I would have noticed by now, I'm pretty sure that Facebook Messenger is worse in this aspect...
Maybe things have changed - certainly when I looked at it a few years ago (around the time that google stopped supporting it) my understanding was that xmpp had no push notification support. The app in the phone had to either poll or explicitly hold open a TCP connection. (Which is problematic when the app is backgrounded.)
I was recently forced to use Facebook Messenger (thanks God it's soon over), and I'm hating it : it's slow on mobile, even worse on PC, where it regularly makes my whole OS hang requiring a reboot.
Scrolling back is atrociously slow, and it doesn't even seem to have a search feature !
I'd take XMPP alternatives like Conversations, Jitsi, Pidgin any day ! (And Element of course.)
XMPP is hardly killed. There are tens of thousands of XMPP servers out there with over a hundred public servers. There are lots of client implementations. Even the really bad implementations manage basic messaging.
Matrix with Element (Riot) as the front-end is pretty close. It does what slack does, it's just not very good. XMPP is arguable. It can be a Slack alternative, if you stitch enough other servers on top of it. Personally, I don't think XMPP will ever be more than chat, but some of its adherents believe differently.
Mattermost is certainly not what I meant. That's just trading one Slack for another.
This thread is about a Slack outage, which you have no control over. Mattermost and similar software is self-hosted, which of course doesn't mean you're getting 100% uptime, but you have (more) control over it.
In practice self hosted usually translates to more downtime and slower performance when it works. Unless your org has more expertise running a chat service than Microsoft or slack, your self hosted alternative is always going to suck more.
Did you try self-hosting and it lead to more downtime and slower performance?
From my experience, when I self-host stuff it's a lot faster (more server resources) and never had any downtime (server doesn't simply go down for no reason).
I haven't tried it personally. But my employer hosts an on-prem Github instance and it is just terrible. So many downtimes, long times before anybody gets around to repair it, general performance issues, maintenance windows for upgrades, etc. Just a huge pain. I've seen this sort of problem with the old Exchange on-prem services too.
IRC may be out these days, but at least deploying a small IRC server for the own team is really not that much effort anymore and doesn't incur that much ongoing maintenance work either.
I suggested RocketChat when the outage was announced and HN community downvoted it quite heavily. I'm not sure why. [0]
We ended making the switch and committed to Discord. We're now looking at Rocket.chat as a backup in case Discord goes down. But Slack is now completely out of the picture for our team.
Just curious - why not use Matternost as a backup? (disclosure: I work at Mattermost, but really just want to know what you think)
I’ve advocated for an idea where Mattermost is to be used as a “bunker” where it is hosted on a raspberry Pi (or somewhere else) and acts as a digital bunker if your critical infrastructure (slack, teams, exchange?) is compromised somehow.
Not OP. Good idea. I thought it's integrated into GitLab (on premise omnibus), but I still haven't fiddled with it, but enabled something in the config file, but nothing happened.
I know it's a tough spot, but if it were usable from GitLab with zero config that would be great for fallback.
Thanks for your reply! I gave it another try, and it works now beautifully :o (Possibly last time I tried it, there was no LetsEncrypt integration and the external URL setup was more involved?)
My experience with RocketChat is that it works quite well on the surface, but after using it for some time, some very annoying bugs emerge:
* You get notifications for channels for which you have suppressed those notifications
* Some channels are marked as having new notifications, when they haven't
* Notifications for new messages in threads you are involved in are quite hard to find (horrible UX)
* Some UX choices are very confusing (you get a column of options related to notifications, and for some, the left option is the one leading to more notifications, for some the right option)
* There are some overlapping features that lead to inconsistent usage (channels vs. discussions vs. threads)
* Threads are hard to read, because follow-ups in threads are shown in a smaller font size. You cannot increase the font size at all in the desktop application
.... and so on.
Also, I tried to submit some bugs, but for that I'd need to have some information which only the admins have that run this instance, and in the end it was too much effort to get all that information together, so I didn't even bother.
I agree, it's still got a long way to go. I'm saying it's still a perfectly viable alternative to Slack. Fast, simple, works. (At least this is my perception/impression. I wanted to evaluate Slack alternatives for some time, but haven't got the time for it yet. So I was surprised when I got an invite to one of our client's rocket.chat instance and things worked pretty well.)
I'm in a slack workspace that is constantly notifying me of a thread, but I can't make it read. Maybe it has something to do with the free messages limit. So the message is there, but cannot be accessed. Annoying as hell. I thought about submitting the bug to Slack, but then just let it go, and probably we'll just move to Signal or something.
If the benefit we are looking for is better up time, that will not happen.
The main benefit is going to be knowing why the system is down, and the eta to being up again.
That takes care of the software and protocol side of things, true, but does it give more reliable and predictable uptime? That's the main thing here; while there are plenty of software alternatives to Slack, their product is not just the software but also the hardware, servers, and scaling. You can get a Slack instance from 10 to >10K members without ever having to worry about your hardware, or how much hours your staff needs to spend on maintaining said hardware. And when there is inevitably downtime, you and your staff don't have to scramble to get it back up - with this outage, it's a shrug, it's down, it'll be back soon probably, I'm going to do some work or do something else. Extended toilet / lunch break.
How often is Slack/Discord down? I mean it's not perfect, but I really honestly don't think I could match their uptime by self-hosting, as well as more on-call rotations for something that's not core product.
I very much prefer that for something that isn't core product, if it goes down I need to do exactly nothing for it to come back up, and that the engineers at Slack will be starting to work on it likely before I even realize it's down.
This is a tale SaaS vendors (which have strong presence in online tech communities like HN because they are software companies) sold very well, and it's probably true for many small startups, but for medium sized companies managing their own platform for something like Slack is completely doable and you will not have those big downtimes compared to Slack. Sure, you have to dedicate time and resources to it, and obviously is not "core business" although a chat platform is a pretty important component in an online company.
I would be surprised if you couldn't match or exceed slacks uptime running whatever alternative you want (IRC, mattermost, rocketchat, etc.) on a random dedicated server.
Hardware is quite reliable these days.
And updates can be scheduled to be at a convenient time for the team.
If you are the only technical person on your team then it's of course not ideal and would require some further thought into making things redundant.
But even that is easy enough to do with IRC (setup two servers, link the irc servers together, single DNS record that points to both servers - job done).
If there are other people on the team that have _some_ technical skills then they can fix it..
IRC lacks quite a few features compared to other solutions, but the reduced complexity does bring very low operational complexity.
IRC will be incredibly hard to use for non-technical people on your team. Mobile clients for IRC look like crap, and have horrible-looking ad bars. No integration with Google Drive, Github, or other things.
It's just not a business-friendly tool.
I'm an engineer and personally I'm fine with IRC, I'm just trying to be realistic here.
Is it really though? If I take a look at a random modern IRC desktop client - how is it more difficult to setup than say your email program?
The amount of information needed on setup is about the same: server, username, password (in fact email can get a bit more confusing in big corporate email setups with differing imap and smtp servers, etc.)
Reality check: Most people don't use email programs anymore.
Also how do you get IRC to sync all conversation data, history, between your several desktops and phones, how do you send files, make calls, and thread conversations?
They are moving the goalposts because there are several and ultimately very many reasons why IRC won't work, they just didn't bother to think of all the reasons and list them at once.
Ultimately there is only one reason that matters: The person in charge of deciding what communication channel to use likes Slack/Teams/IRC/whatever.
Add to that the SaaS propaganda that hosting literally anything yourself is just too hard (it really isn't).
Or this notion people are just too stupid to deal with anything more than the simplest possible web interface - Really? what do those people even do? Stare at Notepad all day? Of course not. They stare at various complicated software packages ranging from CAD, $spreadsheet abominations, SAP to various Adobe software packages.
Sprinkle in a bit of hype for the latest new thing and presto..
</rant>
I'd bet hard money that within epsilon of anyone using a desktop email client in 2020, and thus having one to set up in the first place...is in an organization with access to Microsoft Teams.
Who deals with the downtime if any other on-premises system goes down?
If you are running networks and software on site, and they are business-critical, you have people and a plan for this. Or you don't, and suffer the consequences.
There will always be more downtime on Slack/Discord. There are more users, more updates. Slack/Discord is a giant distributed system with nodes all around the world. An IRC/XMPP server on one machine that 100 people use is not going to crash unless intentionally.
People really over estimate the difficulty of running self-hosted systems with great uptime.
When self hosting you can get away with simpler systems that ends up being more stable and have higher up times for lower effort.
The reason you see cloud providers having issues is not because the thing is difficult, but because doing anything at huge scale ends up being difficult.
> it's not perfect, but I really honestly don't think I could match their uptime by self-hosting
This is such a common misconception. The services I self-host was configured by me, if anything goes down (which they very rarely do), I know the exact cause and have it fixed in minutes. When some company's cloud service goes down I'm completely at their mercy. I also spend very little time on maintaining these services, just security updates, which are mostly automated.
Bottom line, maintaining and self-hosting services that has 1 or a few users is much less complex than services with millions of users. Hence, my uptime is better than Google's, Amazon's, and Azure's, etc.
1. There is virtually zero user-facing documentation. Need to know how to backup keys, verify another user, or what E2EE means? Ask your server operator. Basically the onus is on operators to document this stuff for their users. Except the stuff we're documenting is hard even for server operators, and especially challenging to document in a way that both nontechnical and technical users can understand.
2. Because this stuff is challenging even for more technically minded users to understand, it leads to a kind of burnout for interested non-technical users where they learn all they can about some feature and how it works at a high level from out of date random blogs, try to use the (complex, multi-step) feature, but then something won't work, and it isn't be clear whether it was because the user did something wrong or because the clients or server implementations are broken
3. Issues where core functionality is broken (e.g. two mutually verified users on my homeserver haven't been able to talk to each other in months -- see [1], [2], [3]) languish for months with zero response from maintainers.
4. While core functionality is both broken and undocumented, the maintainers announce rabbit hole features that no one asked for and seem very much like distractions, like their recently-announced microblogging view/client[4]
In short the Element maintainers have shown little interest in making the platform accessible to the people who need its differentiating features the most, and have prioritized the "mad science"/technical aspect of their platform at the expense of the human element (end-users and operators).
It'd be cool if Element used their resources to hire some UX folks and community advocates whose sole focus is addressing the horrid accessibility of their platform. I think most users would rather see that than further "mad science".
Yep, I have, although funnily enough it turns out that the rage shake feature was the only way to submit a bug report with diagnostics from a client (as of a couple of months ago anyway) and that feature itself was broken for one of my users (who has since churned).
That FAQ is a great start, but it's not sufficient for non-technical users. It's not easily searchable, it doesn't provide screenshots, and it doesn't go into enough detail for each item (e.g. describing what can go wrong + troubleshooting).
I am genuinely surprised that Slack wasn't ready for people to come back from holiday, to view increased queues of unread messages, to have to manually login vs. having auth tokens or cookies, etc. Either that, or they had a cosmically coincidental outage on a really bad Monday to have it.
It's bad enough team comms go over Slack so much now, at least we have email fallback. What scares me is for the teams that use Slack for system alerting.
My coworker's theory was someone was waiting for the holiday's end to deploy something risky.
And I'm in that boat of depending on Slack for alerting... in fact my team was also waiting over the holidays to deploy more robust non-Slack-based alerting (in our defense the product is only a few months old and only now starting to scale to any real volume).
I wouldn't be surprised if it's actually a combination of a new feature being recently rolled out, along with the sudden spike in load this morning.
The holidays are actually the perfect time for Slack to roll out a risky deployment, as it has to be their lowest usage time. So it would make sense if something was pushed out last week or the week before. And everything probably seemed fine.
And then this morning they suddenly realize this new feature does not perform under load. And to make matters worse, the new feature has been out long enough to make any sort of rollback very tricky, if not impossible. Which means they'd need engineers to desperately hack out, test and deploy a code fix.
If this is the scenario, I do not envy them at all.
Holidays are a good time for a company to do a risky deployment, but a bad time for an individual employee to do a risky deployment, assuming one doesn't want to work overtime over the holiday fixing things.
Depends on how well compensated holiday overtime is. There are some employees happy to work overtime if their hourly pay is doubled or tripled. However there also those who wouldnt do that for any price.
Depends how bad it goes wrong. My org is a 24/7 one, but one Christmas back in the 90s (way before my time) some work was done on Christmas eve, I think it was on the phone system, in the days before widespread mobile phones.
It broke, which was a major problem, this meant that senior management were being phoned (ho), and relatively high middle managers were on site to deal with the fall out. Of course most suppliers were also closed so everything was harder to fix.
There's good reasons not to do changes when places are closed, or at least skeletoned, for 2 weeks.
I used to work for a place that had a FY that ended in summer. We had a lot less problems with stuff being shoveled out the door at Thanksgiving and Christmas because nobody was trying to finish their year-end performance goals over the Holidays.
I think what I'm implying is that management creates this issue, but we are complicit.
Yeah, I think it's this rather than load. Slack should be able to handle load fine (probably), but since this is the first weekday post-holidays I imagine some deployment broke something.
Slack has been in business for several years and has survived several December to January transitions, including several people stopping using their product before Christmas and then returning early January.
It seems a bit presumptuous to assume that's at fault here, given their age.
The two cliched sources of this problem are 1) someone pushed something out over the holidays that could have waited until January, or 2) peak capacity was negatively affected since the last time a spike happened, nobody had a way to monitor it, and so this has been broken since the end of May. On further reflection, someone will admit that they noticed a notch-up in response times and did not connect the dots.
We have an alert channel in Slack, but it's mostly ignored. Our primary alerts come via SMS/VictorOps.
At one of my old jobs, we had SMS via two physical/hardware devices in our data center. One had a Telstra SIM card and the other had an Optus SIM card. (They were plugged into the same machine, but we had plans to put a second one in another data center before I left).
If you really care about alerts, you should have physical hardware doing your SMS messages via two different point-of-presences.
Now is a good time to recommend to your engineering org that they should have multiple alerting methods, e.g. Slack plus Pagerduty, or Slack plus email.
Hopefully email won't be your backup. I've seen that done. Alerts get filtered and ignored, often by accident.
Having been in this situation before, with a totally-down-and-not-coming-back-up outage of a payments system, I really feel for their incident response team.
I'll take this moment to remind everyone of their human tendency to read meaning into random events. There's no evidence to suggest New Year traffic has caused this, and outages like this can happen in spite of professional and competent preparation.
Hugops for their team, I hope they get it back soon.
> I'll take this moment to remind everyone of their human tendency to read meaning into random events. There's no evidence to suggest New Year traffic has caused this, and outages like this can happen in spite of professional and competent preparation.
On the one hand, sure we don't specifically know what's going on. On the other hand, it's the first Monday in the new year and they went down shortly after the start of the business day Eastern time; it could be coincidence, but it would be a remarkable coincidence.
There are a load of ways NY might have contributed to this, but it may not be a direct cause. What's more likely, Slack forgetting to scale their deployment back up after too much mulled wine, or a number of people on holiday meaning a simple failure has developed into something more serious?
It could be anything really- my post was more about how situations like this can happen to even the most prepared. The assumption it has something to do with NY tends to assume very trivial, silly mistakes. Especially with no information, that seems a bit uncharitable.
It seemingly worked ok in UTC-2 in the morning and early afternoon, then started having issues and is now a bit intermittent (or fixed, there's not much traffic on my channels, as it's evening already). Do they have that much more traffic on US east coast than in Europe?
Probably, but it was only 2-3pm UK time when it started falling over so there would be all the Europe traffic plus the East Coast traffic starting to sign in.
I'm posting this because I found a lot of people don't know that Zoom includes a complete chat client that includes channels.
And #HugOps to the engineers at Slack working on this. I appreciate that they posted a periodic update even when there was no news to report: "There are no changes to report as of yet. We're still all hands on deck and continuing to dig in on our side. We'll continue to share updates every 30 minutes until the incident has been downgraded."
> Like many companies in the last year we've switched to using Slack to improve internal communication. [...] Since Slack doesn't offer an on-premises version, we searched for other options. We found Mattermost to be the leading open source Slack-alternative and suggested a collaboration to the Mattermost team.
I'm not really sure why it's 'GitLab Mattermost' and not (at your link) 'GitLab Nginx' et al. though.
They posted a giant list of the services they use recently.
They use a ton of services.
Likely you don't want your backup to be one of your systems and another part of the company probably uses Zoom already so it is probably easy to fail over to that.
This includes many proprietary ones, we generally choose the product that will work best for us, considering the benefits of open source, but not excluding proprietary software.
Mattermost is not part of the single application that GitLab is. There is a good integration between with GitLab and our Omnibus installer allows you to easily install it. But it is a separate application from a separate company.
It's just due diligence. Think of what you have access to if you have "god mode" on corporate chat: HR, the CFO's DMs, private messages between other coworkers, and so on. Most won't fall for this temptation, but even those with strong anti-spying morals can be weakened by circumstances. Best to remove the temptation by design.
Because someone else's devops can't use it against you institutionally. Nor is going to insist on having an opinion on things that they're unaffected by.
This isn't a slam at devops, it's about the need for institutional information hiding; not everyone needs to know about and weigh in on every decision being made.
We structure our company similarly. With effort DevOps is god on everything except HR, Sales, Finance, Chat, and C-level management which are operated with 3rd party services controlled by the individual departments and "owned/managed" by the C-suite.
DevOps at a lot of small companies also manage the internal IT stack and sometimes even take on most of the IT duties. Once you get larger you start having "IT" as something separate from DevOps but with the actual infrastructure managed by operations. Once you're really big the teams are truly separate and IT owns their own infra.
As someone who has to use Zoom Chat to interact with a client on a daily basis, please, do not recommend Zoom Chat to anyone except as an example of how not to do chat software.
--
Though, I do agree wholeheartedly with your sentiment that the Slack team needs all the positive vibes they can get right now.
As someone who has to use Zoom Chat every day, this a thousand times. (We still run an XMPP server on the side just to avoid the horror that is Zoom chat.)
I continue to be impressed by GitLab's operations and documentation! While, yes, others may have similar backup plans, as an outsider, it feels like GitLab's handbook seems cooler even if only for their publishing, and making public of their practices and processes. I'll caveat that I'm not really a fan of zoom/slack/hangouts (I'm an unashamed fanboy of matrix and its numerous clients), but gitlab's approach is still really neat! Kudos to gitlab!
Unpopular opinion, but WebEx beats the pants off Zoom. Of course, it's neither free nor open. But it does support strong end to end encryption and authentication and has regulatory compliance to a bunch of things, if that's important to you. I get that there is WebEx hate because "enterprise" etc, but we use it around here and it works quite well.
"Business takes the easy and ethically questionable route to continue making money" news at 11.
I'm not condoning Zoom's actions but this is hardly a problem unique to Zoom. Few if any businesses will stand up for consumers and citizens unless it's directly aligned with their profit motive. In this case, the business choice is to operate or not in mainland China. If they choose to stand up against the Chinese government they're going to have difficulty continuing to operate in China and risk losing that entire market.
Google played this PR game many years ago in China (rejecting some of the governmental policies) and ultimately caved to Chinese policies to do business there.
Businesses are not the organizations we should look to for empowering people, that's simply not their goal no matter how much their marketing team may want to sell that idea by following trending (popular) social movements that they've already done market studies on to assess potential fallback.
I think it's a pretty bold claim to state that Zoom's actions aren't unique.
What other business in this space has given China unfettered access to US users and data? I'm not aware of it occurring with Webex, Teams or go2meeting. The "one rogue employee" thing falls flat pretty quickly when they're the only ones that had this issue.
This feels like their encryption thing all over again, there's an "oversight" that is equivalent to a backdoor that only gets fixed when they get caught.
I didn't realize they shared any user data outside China (misread the WP portion). It appears they did share 10 users' data which is a bit questionable but I'd hardly call that unfettered access to US data.
The fact is all of the US businesses operating in China give surveillance ability to the Chinese government for the Chinese users and are operating in an ethically questionable space being primarily based outside of China, at least in my opinion.
It's really not too different than the businesses sharing US citizen data to the US government, much of which Snowden and others before him exposed. I suspect there's a lot more surveillance going on everywhere than the general public know about and the businesses best positioned to do the surveillance are probably doing it.
Elaine Chao’s sister is married to Xi, while Elaine, as transportation secretary under Trump, was busted inviting family with business ties to the CCP to official US government meetings.
The fear on this forum is imagined political thriller more than realistic.
Every technologist is grifting off the military industrial complex.
The lesser of two evils and the product just works. They might have a few governance issues they need to fix. But at the end of the day, they signed a BAA with us and will take the liability and fallout of a breach.
One nation is currently operating concentration camps and arrests and seizes the property of prominent citizens who criticize the government. Are you sure that's an equivalence you want to draw.
Like Guantanamo Bay or prosecution of Assange for his journalistic work to expose wrongdoing of government? Or maybe you’re talking about for-profit prison system and mass incarceration practices? But you’re probably talking about China, right?
We have thousands of brown people in camps along the border, in brutal conditions, without access to healthcare(unless you count forced sterilizations as healthcare). Do you consider those to be apples as well?
Why are they in camps along the border? Why are the Uighur? Did the "brown people" break any laws? Did the Uighurs?
Are the "brown people" in camps along the border a single, ethnic minority? Are all "brown people" in the country subject to arrest and under surveillance just for being "brown"?
> No one imprisoned in Guantanamo Bay is a US Citizen and neither is Assange.
That's a glib retort.
A takeaway from your position is that it's ok so long as you do it to citizens of other countries.
> it is not the same as ethnic cleansing.
See the above.
That's always been the difference between the US and China and why so many countries have hatred for us and yet little to none for China. They don't fuck with other countries on the level that we do.
Yea, but you live here and so you should think about the implications of this for yourself and your countrymen and not through the lens of international competition. That is a distraction.
Essentially, the China case proved Zoom is willing to cooperate with a nation state. The US is the nation state we live in, Zoom is HQ'd here. Therefore, the risk to us is high.
As an aside, the organ harvesting idea comes from the Fulan Gong, who are similar to Chinese Scientologists. It is not clear to me that their claims are accurate.
Sorry, but an executive is not just "an employee" and any alarms are rightfully justified. Took a little bit of cajoling in my company but we've successfully moved to self-hosted tools for the most part (Jitsi and Rocket.chat) with just a couple of projects with outside contractors using Slack.
It's weird that you describe the headline as "overwrought" and call the person an "employee" when the headline is more accurate than you.
This was an executive, not just an employee. That's a huge distinction and I can't help but think you intentionally downgraded his position to cover-up his behavior. "Just an employee" "Not a big deal"
But when you read the allegations, they seem like a very big deal that an executive was spying on users, giving their information to the Chinese government explicitly for oppressive purposes, including folks who are not in China, and went out of his way to personally censor non-Chinese groups meeting to discuss the Massacre-Which-Cannot-Be-Mentioned.
I would say the headline understates the gravity (it's very much a 'by-the-books' headline that you KNOW went through ten levels of Legal), and that your hand waving here feels much more dishonest than the headline.
Regardless of intent, it's undeniable that at some point there were insufficient controls to prevent this executive, or any executive in the future, from gaining this level of surveillance access.
And it's also undeniable that the consequences for Zoom (really, just needing to fire a few people, and not even the people who designed those controls if there were any) are so minimal that they have no incentive to strengthen those controls.
For some organizations (mine included) the benefits of Zoom outweigh the risks of Zoom having proven itself to not have those controls, namely the possibility of both political and corporate espionage. As with all things, YMMV.
It was an executive purposefully brought in for legal compliance with that country's requirements. That he was fired is a huge signal in how seriously aggressive zoom is about protecting data that they would even be willing to go up against national governments. I feel like the firing is a huge part of the story.
There are remarkably few organisations I somewhat trust (even then on a sliding scale) but on that spectrum Zoom sits at the "wouldn't touch them with someone elses bargepole" end.
While Slack is down, let's remind ourselves that it is not the end of the world. To their ops team, good luck in sorting out the root cause(s), to mitigating their re-occurrence, and to emerging the other side a stronger team. You've got this.
I have always had this fantasy thinking of what happens when outages of one of these major service never come back online i.e. in this outage Slack loses info of all the accounts, users, messages etc.
How would people react? What would engineers do to recover? I always found that idea fascinating.
Imagine Google saying tomorrow that they lost all the accounts and emails. What kind of impact the world will have?
That scenario is what Disaster Recovery plans are for. Every large company I've worked for has had recovery plans in place, including scenarios as disturbing as "All data centers and offices explode simultaneously, and all staff who know how it all works are killed in the blasts."
You not only have backups in place, you have documentation in place, including a back-up vendor who has copies of the documentation and can staff up workers to get it up and running again without any help from existing staff.
And we tested those scenarios. I'm not sure which dry runs were less fun - when you got paged at 3 AM to go to the DR site and restore the entire infrastructure from scratch... or when you got paged at 3 AM and were instructed to stay home and not communicate with anyone for 24 hours to prove it can be done with out you. (OK, so staying home was definitely more fun, but disturbing.)
This scenario isn't as far fetched as people think. I was running a global deployment in 2012 when hurricane Sandy hit the east cost. The entire eastern seaboard went offline and was off for several days. Some data centers were down for weeks. Our plan had covered that contingency and we failed all of our US traffic to the two west coast regions of amazon. Our downtime on the east cost was around two minutes. Yet a sister company had only one data center in downtown New York, and they were offline for weeks, scrambling to get a backup loaded and online.
I worked for a regional company in the oil and gas industry and the HQ and both datacenters were in the same earthquake zone. A twice per century earthquake had a real risk of taking down both DCs and the HQ. The plan would have been for every gas station in the vertical to switch to a contingency plan distributing critical emergency supplies and selling non-essential supplies using off-grid procedures.
That’s some really good thoughts on DR planning. I have never thought DR to be to such an extent.
How many companies really plan for an event where their entire infrastructure goes offline and their entire team gets killed? Does even companies like Google plan for this kind of event?
> Some of the temporary locations, such as the W Hotel, required significant upgrades to their network infrastructure, Klepper said. "We're running a Gigabit Ethernet now here in the W Hotel,'' Klepper said, with a network connected to four T1 (1.54M bit/sec) circuits. That network supports the code development for a Web-based interface to the company's systems, which Klepper called "critical" to Empire's efforts to serve its customers. Despite the lost time and the lost code in the collapse of the World Trade Center towers, Klepper said, "we're going to get this done by the end of the year."
> Shevin Conway, Empire's chief technology officer, said that while the company lost about "10 days' worth" of source code, the entire object-oriented executable code survived, as it had been electronically transferred to the Staten Island data center.
The two I've worked for that took it that far were a Federal bank, and an energy company. I have no idea how far Google or other large software companies take their plans.
But based on my experience, the initial recovery planning is the hard part. The documentation to tell a new team how to do it isn't so painful once the base plan exists, although you do need to think ahead to make sure somebody at your back-up vendor has an account with enough access to set up all the other accounts that will need to be created, including authorization to spend money to make it happen.
The last company I worked for where I was (de facto) in charge of IT (small company so I wore lots of hats) could have recovered if both sites burnt down and I got hit by a bus since I made sure that all code, data and instructions to re-up everything existed off site, that both most senior managers understood how to access everything and enough to hand it to a competent firm with a memory stick and a password.
In some ways losing your ERP and it's backups would be harder to recover from than both sites burning down, insurance would cover that at least.
Yes, Google plans extensively and runs regular drills.
It's hearsay, but I was once told that achieving "black start" capability was a program that took many years and about a billion dollars. But they (probably) have it now.
"black start" for GCP would be something to see. Since the global root keys for Cloud KMS are kept on physical encrypted keys locked safes, accessible to only a few core personnel, that would be interesting, akin to a missile silo launch.
"Black start" is a term that refers to bringing up services when literally everything is down.
It's most often referred to in the electricity sector, where bringing power up after a major regional blackout (think 2003 NE blackout) is extremely nontrivial, since the normal steps to turn on a power plant usually requires power: for example, operating valves in a hydro plant or blowers in a coal/gas/oil plant, synchronizing your generation with grid frequency, having something to consume the power; even operating the relays and circuit breakers to connect to the grid may require grid power.
The idea here is presumably that Google services have so many mutual dependencies that if everything were to go down, restarting would be nontrivial because every service would be blocked on starting up due to some other service not being available.
I work for a bank. We have to do a full DR test for our regulator every six months. That means failing all real production systems and running customer workloads in DR, for realsies, twice a year. We also have to do periodic financial stress tests - things like "$OTHER_BANK collapsed. What do you do?" - and be able to demonstrate what we'll do if our vendors choose to sever links with us or go out of business.
It's pretty much part of the basic day-to-day life in some industries.
The company I work for plans for that and it's definitely not FAANG. In fact, DR planning and testing is far more important than stuff like continuous integration, build pipelines, etc.
> Every large company I've worked for has had recovery plans in place, including scenarios as disturbing as "All data centers and offices explode simultaneously, and all staff who know how it all works are killed in the blasts."
I sat in on a DR test where the moment one of the Auckland based ops team tried asking the Wellington lead, the boss stepped in and said "Wellington has been levelled by an earthquake. Everyone is dead or trying to get back to their family. They will not be helping you during the exercise."
Thanks for sharing, for some reason I think about this story a lot. It must have been such an emotionally difficult time for everyone involved in piecing back together their processes.
>Thanks for sharing, for some reason I think about this story a lot. It must have been such an emotionally difficult time for everyone involved in piecing back together their processes.
I was there as a consultant and didn't know anyone there when I went.
I won't provide any details out of respect for those fine people, but the grief was so thick, you could have cut it with a knife. As I said, I didn't know anyone who was there (or wasn't there) but after a day, I wanted to cry.
My tangential thought in that regard is what if this is a really bad outage that causes Slack to tank (i.e. A large number of companies switch to Microsoft, Zulip, etc). Equally interesting a thought.
In 2011 a small amount (0.02%) of Gmail users had all their emails deleted due to a bug: https://gmail.googleblog.com/2011/02/gmail-back-soon-for-eve... They ended up having to restore them from tape backup, which took several days. Affected users also had all their incoming mail bounce for 20 hours.
Google would be catastrophic because so much is stored there.
Slack is mostly real time communication, at least for me. There are a few bits and bobs that really should be documented that are in the messages though.
Yeah, Google would easily top the list of companies which can have catastrophic impact. Microsoft, Apple, Salesforce, Dropbox would be the next in the list I guess if we leave out the utility companies and internet providers etc.
Just look at the impact a 40 minute outage of Google Auth had last month, I wouldn't be surprised if the global productivity hit during that outage was in the billions of dollars, and that was for a relatively short outage without any data loss.
AWS outages have basically crippled a few businesses. The longest I know of was 8-10 hours the day before Thanksgiving. Some Bay Area food company got hit by it and couldn’t deliver thanksgiving dinners.
Being in DR,I live my life wondering about that too. I spend alot of extra time checking accounts and making sure that I print (yes, sneakernet) out important data as well as have manual copies of passwords. Its old school, but it removes the risk to my business in case of a total loss of a global service and lowers the risk of a heart attack and related stress.
The rest of the world may not be so energetic re: their accounts and data, so it would be painful for many, it depends on their how much risk they are willing to experience.
Being in DR, it is very difficult for businesses to allocate the time and resources to good planning - for many, DR is an insurance policy. Staff: engineering and development are focused on putting out fires however, a real DR is more than most companies can handle if they have not planned accordingly or practiced through testing failover/normalization processes as well as performing component-level testing.
This should actually be part of your Disaster Recovery plan. You should have at least some plan for the loss of all of your service providers. Even if that plan is to sit in the corner and cry (j/k).
We might start to see actual legislation around implied SLAs in the US which would cause Google to rethink everyone's 20% project being rolled out for 2 years.
Services like Slack are replaceable to most extent. How does even replace a service like Google easily? There are like to like services available for Google but the data is where it becomes tricky. Almost 1bn people losing their email addresses could cause massive issues.
These events seem to be happening almost on a monthly basis now. IRC was never this unreliable and at least with netsplits it was obvious what had happened because you'd see the clients disconnect.
IME messages just fail to send with Slack, then you can retry but they're not properly idempotent and you end up sending the messages twice.
Its especially strange when you think about how unoriginal Slack's product domain is, and how comparable, and in some cases small, their userbase is.
* iMessage, which likely handles something in the range of 750M-1B monthly actives.
* WhatsApp, 2B users [1], though no clarity on "active" users.
* Telegram, 400M monthly actives [2]
* Discord, 100M monthly actives [3]
* Slack, 12M daily actives [4]
* Teams, which is certainly more popular than Slack, but I shudder to list it because its stability may actually be worse.
The old piece of wisdom that "real-time chat is hard" is something I've always taken at face-value as being true, because it is hard, but some of the most stable, highest scale services I've ever interfaced with are chat services. iMessage NEVER goes down. I have to conclude that Slack's unacceptable instability, even relative to more static services like Jira, is less the product of the difficulty of their product domain, and moreso something far deeper and more unfixable.
I would not assume that this will improve after they are fully integrated with Salesforce. If your company is on Slack, its time to investigate an alternative, and I'm fearful of the fact that there are very few strong ones in the enterprise world.
I didn't realize that Discord has way more active users than Slack. I'm glad, Discord is a fantastic service in my experience. It's a shame they got shoe horned into a mostly gaming oriented service. I've never had a class or worked somewhere where Discord was a considered solution instead of Slack, but I can't think of anything that Slack does better (in my experience). In general, I think Discord has the best audio and video service that I've used, especially kicking Zoom to the curb.
Discord is definitely in the same realm of scale as Slack, and probably bigger (they publish different metrics, so its hard to say for sure).
The really impressive thing about Discord's scale is the size of their subscriber pools in the pub-sub model. Discord is slightly different than Slack in the sense that every User on a Server receives every message from every Channel; you don't opt-in to Channels as in Slack, and you can't opt-out (though some channels can be restricted to only certain roles within the Server, this is the minority of Channels).
Some of the largest Discord servers have over 1 million ONLINE users actively receiving messages; this is mostly the official servers for major games, like Fortnite, Minecraft, and League of Legends.
In other words, while the MAU/DAU counts may be within the same order of magnitude, Discord's DAUs are more centralized into larger servers, and also tend to be members of more servers than an average Slack DAU. Its a far harder problem.
The chat rooms are oftentimes unusable, but most of these users only lurk. Nonetheless, think about that scale for a second; when a user sends a message, it is delivered (very quickly!) to a million people. That's insane. Then combine that with insanely good, low latency audio, and best-in-class stability; Discord is a very impressive product, possibly one of the most impressive, and does not get nearly enough credit for what they've accomplished.
For comparison; a "Team" in Microsoft Teams (roughly equivalent to a Discord Server or Slack Workspace) is still limited to 5,000 people.
I really agree Discord is amazing and wish I could use it for work instead of Slack.
I think the big things that prevent it from being adopted more for professional use is the lack of a threading model (even though I hate it when people use threads in Slack) and the whole everyone in every channel except for role-based privacy settings. The second one especially is a big deal because you can't do things like team-only channels without a prohibitive amount of overhead.
That said (with zero knowledge of their architecture) I have to feel like both of those missing features aren't too terribly hard to build. Its very likely Discord is growing as a business fast enough on the gaming and community spaces they don't feel the added overhead of expanding into enterprise (read: support, SLAs, SOC, etc) makes sense and are waiting until they need a boost to play that card.
> I think the big things that prevent it from being adopted more for professional use is the lack of a threading model
They do have a threading model now (if you are talking about replying to a message in a channel and having your reply clearly show what you are responding to). If you are talking about 1-on-1 chats with other people in your same server then yes, that is still lacking IMHO in discord. The whole "you have to be friends" to start a chat (or maybe that's just for a on-the-fly group) is annoying.
Discord gives every user an identity that is persistent beyond the server; you have a Discord account, not a server account. Slack does the opposite. Enterprises would hate Discord's model, as they prefer to control the entire identity of every user in their systems, such that when they leave the company they can destroy any notion of that identity ever existing.
Absolutely agree. I like the 1 main discord account but I wish I could have 1 "identity" per-server as well. I don't love that I am in some discords that I don't want tied to my real name and others where I've known these people for over a decade and would see in person multiple times a week (before the pandemic). I know you can set your name per-server but you can't hide your discord username (or make it per-server) which sucks.
Agreed completely. Discord has always been much smoother for me than Slack, and the voice/video chat quality is literally the best I've ever seen anywhere.
If they made their branding a bit more professional and changed the permission model from the (accurate) garbage you described to something closer to Slack then I think Slack would be doomed.
>I didn't realize that Discord has way more active users than Slack
Keep in mind you're comparing daily active users vs monthly active users. I'd guess most slack users are online weekday for pretty much the entire day (because it's for work and your boss expects you to be online), whereas a good chunk of discord users are only logging in a few hours a week when they're gaming.
Minecraft official server: 190k online users. | Fortnite official server: 180k online users. | Valorant official server: 170k online users. | Jet's Dream World (community): 130k online users. | CallMeCarson server (YouTuber): 100k online users. | Call of Duty official server: 90k online users. | Rust (the game) official discord: 80k online users. | League of Legends official server: 60k online users. | Among Us official server: 50k online users.
Their scale is insane. Even with their usage spiking during after-hours gaming in major countries, their baseline usage at every hour of the day, globally, makes it one of the most used web services ever created.
Slack's DAU and MAU numbers are probably pretty close to one-another. Discord's MAU/DAU ratio is probably bigger than Slack's. That just means that Discord is, again, solving a harder problem; they have much bigger (and more unpredictable) spikes in usage than Slack. Yet, its a far more stable and pleasant product.
Well for the real time side, I can't tell you how big a boon it's been to build our platform on top of Elixir/BEAM. Hands down the best runtime / VM for the job - and a big big secret to our success. Where we couldn't get BEAM fast enough - we lean on rust and embed it into the VM via NIFs.
2021 is the year of rust - with the async ecosystem continuing to mature (tokio 1.0 release) we will be investing heavily in moving a lot of our workloads from Python to Rust - and using Rust in more places, for example, as backend data services that sit in front of our DBs. We have already piloted this last year for our messages data store and have implemented such things as concurrency throttles and query coalescing to keep the upstream data layer stable. It has helped tremendously but we still have a lot of work to do!
To help scale those super large servers, in 2020 we invested heavily in making sure our distributed system can handle the load.
Did you know that all those mega servers you listed run within our distribution on the same hardware and clusters as every other discord server - with no special tenancy within our distribution. The largest servers are scheduled amongst the smallest servers and don't get any special treatment. As a server grows - it of course is able to consume a larger share of resources within our distribution - and automatically transitions to a mode built for large servers (we call this "relays" internally.) At any hour, over a hundred million BEAM processes are concurrently scheduled within our distributed system. Each with specific jobs within their respective clusters. A process may run your presence, websocket connection, session on discord, voice chat server, go live stream, your 1:1/group DM call, etc. We schedule/reschedule/terminate processes at a rate of a few hundred thousand per minute. We are able to scale by adding more nodes to each cluster - and processes are live migrated to the new nodes. This is an operation we perform regularly - and actually is how we deploy updates to our real time system.
I was responsible for building and architecting much of these systems. It's been super cool to work on - and - it's cool to see people acknowledge the scale we now run at! Thank you!! It's been a wild ride haha.
As for scale, our last public number perhaps comparable to Slack is ~650 billion messages sent in 2020, and a few trillion minutes of voice/video chat activity. However given the crazy growth that has happened last year due to COVID - the daily message send volumes are well over the 2 billion/day average.
Just anecdotal, but as someone who has used Teams continuously for 1.5 years, I can say that it has never been down for me.
That being said, individual instances of the app are notoriously unstable causing random annoyances. But, I am on a very early build of Teams, which is buggy by definition.
Slack and the others have different contractual guarantees and different regulatory environments. Comparing them is not really fair because the reality is that these other services probably just lose tons of messages and slack/teams can't do that! They have to have better guarantees.
That's kind of the definition of a service being up. :) I've experienced numerous "soft" outages which result in messages not sending and getting lost - and even more double sends, sometimes very distant from where the message was originally sent.
It isn't just # of users, though - SlackOps is probably unique to Slack in that list (minus Teams, I guess) - so # of messages per month is a better metric. Not that I'm letting Slack off the hook, it still may be that their codebase and/or dev process is just nasty.
I'm the opposite. Back when in my early teens, friends and I would attempt to hijack opposing groups' channels via takeovers during net-splits (and ofcourse having the same done to us). What a time to be alive.
In the early battle.net days competing clans would split and steal channels. It was tons of fun. Taught me lots about bots, proxies, simple scripting, in the process too.
I do miss them, terribly. Lightweight, fast, brutally simple. Even with splits, it was better, and ever since IRC bouncers exist, like ZNC, they are rock solid.
I'm sure you know this already, but that status page isn't worth the cycles on your CPU, you would be better served asking the toaster if AWS is functioning properly than checking that status page.
Our prod systems seem to be working, but our lower environments seems to be not working. I don't know enough about where these things come from. I wonder if the real problem is regional. Some connections work and some don't.
I never knew this, but I think it makes sense. Is there any documentation that explains why this is the case? I suspect it is to distribute bias to the first option, but I'd love to read about it.
I'm still dreaming of a world where everyone uses IRC through an interface identical to Slack or Discord or whatever, and features like these are implemented.
I agree in principle, but IRC is a poor way to do this. I love IRC for it's simplicity, but that makes it hard to do more advanced features. It's a text-only protocol (other than DCC), so if you want to do something like allow users to click phone numbers to dial them then you have to regex it and hope for the best. Any kind of link is the same way. If you want to show images inline, you'll have to search for links, then either do another regex to see if the link is an image or prefetch the page to see if it's an image. Most servers still implement user authentication as a secondary service (i.e. it isn't part of the IRC server itself) afaik. I think the newer IRC specs include those, but support for it is missing in many servers.
Really a huge part of IRC's difficulty and beauty is in not having a markup language, but most of that beauty is for the eyes of the developer, not the user.
I like the concept of Matrix. That's kind of what they're trying to do by creating an open protocol, but when I looked at implementing a client it was non-trivial. For IRC, you can usually send someone a telnet log of you joining an IRC server and they could implement a client. I don't get the impression that that's true for Matrix.
https://news.ycombinator.com/item?id=20948530 is my attempt to demonstrate that implementing a Matrix client is almost as trivial as telnetting to port 6667 on an IRC server, fwiw :)
You might like irccloud; it's a web client (similar to slack) and bouncer, with support for image uploads, has a decent app, preserving history and I think it supports search too.
Not really a fan of the Slack or Discord user interface myself, but there are modern looking web clients for IRC such as thelounge[0] or kiwiirc[1] that might be what you are after.
Several IRC servers do have support for authentication and access control (and audit trails as well I suppose).
Only centralized history/logging and search would need to be bolted on if needed.
In the non-centralized case your IRC client takes care of all of that.
For business users, there are regulatory requirements. You need to keep information around for some period of time, but not forever. History and searching is useful for spreading tribal knowledge throughout an organization.
Does that actually extend to Slack/slack-like things though?
Since I would see Slack more of a replacement for phone calls or hallway discussions.
Neither of which typically has any logs or recordings (and I wouldn't want to work somewhere that did keep such logs).
In what areas would you find such requirements? And shouldn't the default position be that it is illegal to keep those logs? Especially those involving direct messages between employees.
Our company uses Cliq. I wouldn't say that it's as good as Slack, but it's probably 80-90%, and even has a few unique features (integration into Zoho's suite, remote work checkin, integrated bot development environment, etc)
I find it amazing that we can be about an hour and a half into a service being completely unusable(ie. Slack telling me it 'cannot connect'), yet it's still marked as an 'incident' instead of an 'outage' in their own status page
Every time this kind of thing happens HNers love to grip about how the status pages aren't correct yet. It's so weird -- like the people freaking out about the outage are going to be updating their uptime trackers right now or something. Who cares? It'll be fixed later.
I think the point is that a "Status Page" should show the accurate, current status of the system. Not a place holder for "we'll fix it later". People look at a status page to know what's happening now.
This is entirely in line with my experience dealing with outages. 85% of the time to fix consists of fielding requests for status updates.
It's like when people push the elevator button repeatedly if it's taking a while to arrive, only pushing the elevator button doesn't cause it to take even longer.
It doesn't. The status page is currently showing information about the outage. And the 100% uptime number is probably still correct, since it's only been out for a couple of hours.
> And the 100% uptime number is probably still correct, since it's only been out for a couple of hours.
It's listed as "Uptime for the current quarter"; if they mean that as "calendar quarter", i.e. since the start of the year, then we aren't even 100 hours into the quarter so we should be well below 100% by now.
You might be correct, but why would anyone care about quarter-to-date as opposed to a rolling quarter ending now? The latter would mean that an outage of X duration will always reduce this statistic by the same amount regardless of how close the nearest calendar quarter boundary is, which seems like a superior quality for such a statistic to have.
That would be a completely fair metric to publish, but it doesn't look like what Slack is publishing. Of course, it's possible that it is and it's just phrased somewhat poorly.
Interestingly their uptime for the quarter is still 100% despite a full-red dashboard. I wonder if that's something that is calculated only after an outage is resolved
Building out the infrastructure to automatically give real-time updates to your uptime figure sounds like a terrible use of company resources. Who knows how many person hours to spend on implementing and maintaining a feature that would remove maybe a few minutes of manual work from the incident post-mortem checklist, just for the sake of delighting people who need something else to look at for a workplace distraction now that Slack is down.
Do you happen to know of any desktop clients that support encryption/cross-signing?
I'd like to get off of Element desktop/web for a couple of reasons, but I need those features. I'd help implement them myself, but that's beyond my skill level.
Edit: For anyone else wondering, matrix-commander [0] looks like it may be workable if a cli tool is acceptable for your usecase.
I'm a Carl. I'm also looking for a coworker who was trying to contact me. If it's about last saturday, I promise nothing really happened between me and her, but I'm sure she already told you.
I have tried to sell my organisation on a shared Google Chat doc for 90s style realtime ICQ chat in times like these, but there has been little uptake.
G Suite actually has an entire Slack clone, chat.google.com. I've been on G Suite (now annoyingly renamed to Google Workspace) for years and actually just recently found it existed from another comment on HN.
Yeah, this is what we actually use as a fallback, and I did push for this as an full time alternative given we'd get it free, but people dislike it for all sorts of frivolous reasons.
I say this every time Slack is down, but they just seem so shady to me. Nobody can connect right now, and their status site says "100% uptime in the last quarter". Maybe it's close to 100%, but it ain't 100%.
I think we should push for a metric where "up" means 100% of people that want to use the service are able to use the service. If 1% of users can't send messages, then that should count as a full-blown outage and should start counting against whatever SLA they advertise.
The underlying problem here is that apparently everyone lies about uptime, so if you don't, that looks bad to potential customers. I fear that we will have to push for some legal regulation if we want accurate data, and ... people will probably be opposed to that.
Seems silly to worry about quarterly stats several hours into an outage. The most obvious explanation is quarterly stats aren't generated in real-time -- which isn't "shady" to me.
> I think we should push for a metric where "up" means 100% of people that want to use the service are able to use the service.
I mean, that’s nice to say, but how do you measure/prove it?
Certainly, having the SLAed party check themselves is silly. But what are the other options? If it was up to the customer, customers could make up faults to get free service. (Since it’d be up to the customer to prove, and customers are generally less technical than vendors, you’d have to expect/accept very non-technical — and thus non-evidentiary! — forms of “proof”, e.g. “I dunno, we weren’t able to reach it today.” Things that could have just as well been their own ISP, or even operator error on their side.)
IMHO, contractual SLAs should be based on the checks of some agreed-upon neutral-third-party auditor (e.g. any of the many status/uptime monitoring services.) If the third party says the service is up, it’s up in SLA terms; if the third party says the service is down, it’s down in SLA terms.
(And, of course, if the third party themselves go down, or experience connectivity issues that cause them to see false correlated failures among many services, that should be explicitly written into the SLA as a condition where the customer isn’t going to get a remedial award against the SLA, even if the SLAed service does go down during that time. If the Internet backbone falls over, that’s the equivalent of what insurance providers call an “act of God.”)
But in a neutral-third-party observer setup, you aren’t going to get 100% coverage for customer-seen problems. An uptime service isn’t going to see the service the way every single customer does. Only the way one particular customer would. So it’s not going to notice these spurious some-customers-see-it-some-don’t faults.
So, again: what kind of input would feed this hypothetical “100% of customers are being served successfully” metric?
ETA: maybe you could get closer to this ideal by ensuring that the monitoring service 1. is effectively running a full integration test suite, not just hitting trivial APIs; and 2. if gradual-rollout experiments ala “hash the user’s ID to land them in an experiment hash-ring position, and assign feature flags to sections of the hash ring” are in use by the SLAed service, then the monitoring service should be given N different “probe users” that together cover the complete hash-ring of possible generated-feature-flag combinations. Or given special keys that get randomly assigned a different combination of feature-flags every time they’re used.
The idea is to define availability as "the probability that the site 'appeared' to be down for a random user, averaged over a time window of size w". You can choose a particular value of w and look at trends over time, or you can plot availability as a function of w to understand patterns of downtime.
They should at least update the status site to reflect issues currently happening.
I was wondering why the link from a Jira wasn't opening in slack, the page eventually timed out and gave me a link to status.slack.com where it told me everything was peachy. Cue me wasting time trying it again because apparently there was no issue with slack..
Some companies do this, though probably not publishing data. Any customer downtime is treated the same - for one, for many, for all (in theory, ha ha). But they take it pretty seriously.
You'll just end up with no SLA or pay a hefty amount to use services because that's an impossible standard to support for any service of a size like this.
Isn't this the problem? Companies like Slack set SLA's that they only meet by lying about their uptime. It's as good as having no SLA, except you're likely paying a premium based on the SLA they set.
I'm not demanding 100% uptime, I'm asking that they say "99.94% uptime" when there has been an outage.
Honestly, I could live with a 99.50% SLA, if that's what it really was. After today's probably full-day outage, they'd just have to be extra careful for the rest of the year (or pay me money). Kind of sucks when it's 1/4 that you blow your year's SLA budget though.
If you're asking genuinely then I can tell you my experience when I was part of a SaaS shop, though the times have changed a lot and "my metric is not necessarily your metric".
But it was roughly "one large impact a month, for six months", with large caveats that upper management for whatever company had to be working with the product during that month.
Large companies don't care if X service went out during the night and impacted someone not in their timezone.
If the CTO notices that he can't use something with the same regularity that he gets paid, then it doesn't take long for it to stick in their mind. But migrating everything is _so painful_ that the majority of large companies will do anything they can to avoid moving away.
This is a key point is the popularity amongst VCs in investing in B2B SaaS. I take their (and your) word for it. But honestly, I don't actually understand this.
Medium sized team on Slack. We'd need to move ~60 full time in-house employees, ~10 remote contractors who aren't on other comms channels, ~20 infrequent freelance contributors who may not check messages often, ~5 custom bots and apps, and ~15 3rd party integrations (of which some won't support any given choice of alternative).
This is not to mention the fact that half our staff aren't hugely technical, so have actively _learnt_ how to use Slack and it's features around notification control (things that may come "naturally" to the tech-savvy crowd on HN), @-things, bots, etc, and they would need to re-learn a new tool that is going to work in a different way.
This would be a substantial effort for us, and we're a small company. Are there ways to materially minimise this cost?
Training, integration with proprietary internal systems, sheer momentum in the employee base, justifying or even creating a metric to show cost savings of a migration effort, business processes that rely on a specific feature of existing infrastructure needing to be met, the uncertainty of new vs the certain and known instability of something you have....
If you had a small shop with a dozen tech-savvy people and Slack became a problem which was used exclusively for quick business chats, you could probably push a change to another chat platform the next day. You might struggle when you have thousands of employees, some that needed training to use Slack and still aren't that proficient.
Getting workflows re-established, any integrations you had developed or otherwise come to depend on may not work, you will probably lose history, etc.
Plus, it will just take a long time to get everyone on board and using the replacement system. My department is slowly plodding towards using Teams over Slack, but there are enough hold-outs (my sub-department being one of them) that it still doesn't have wide-spread adoption.
Many reasons, almost none of them technical. Off the top of my head, a few:
* Getting out of the Enterprise Contract, or waiting for the year to end.
* Training people on new software.
* Loss of productivity. (1) Learning a new UI, processes, workflows -- both individually and organizationally. A feature or concept in "Tool A" may exist in a completely different form in "Tool B". Or not exist, and then people need to adapt to and work around the missing feature. (2) Missing out on needed information due to the above. Ultimately, software exists to move and transform data, and when you change the software people have to adjust. Sometimes that doesn't go great. "Oh, I didn't realize I needed to check this checkbox".
Another way to say this is "organizational inertia", which is a fancy term that means "it's hard for people to adjust to change".
And you might think developers and other technical people would have an easier time of it. They (we) do, but not to the extent you may expect. I've been on the front lines of a handful of migrations that affected only the IT staff, and it was a long and arduous process each time.
Man it bothers me so much when applications change their UIs on updates for no apparent reason other than "it looks better".
IntelliJ changed the way build and debug buttons looked in some update and it took me days to get used to it and I could find them in a snap again. Slack did a couple of no-reason changes as well.
There are plenty of UX reasons (learning new interfaces, etc). The burden here is generally distributed and diffuse.
The really big one, for companies of a certain size / cash flow, is compliance. Companies spend a lot of time developing compliant work flows around a service like Slack.
Migrating to another service requires rewriting the compliance narrative. The current compliance people might not have the confidence or willpower to do that effectively, and can raise legal objections to any such migration indefinitely.
IRC is easy to migrate from since there is nothing to migrate other then chat history. IRC is also missing so many features that slack provides out the box. And a law like that would not work since you would need to write complicated transformation scripts to transform between services. Also not all services are a 1-1 mapping. I like IRC but it has its limitations. That is why slack succeeded where IRC did not.
I have less of an excuse not to be more personally productive, but I can't help anyone else (easily) if my primary method of communication is down. Not only because it's harder to contact you, but also because it's impossible for you to just ask in a channel and have me notice you.
There's also this perverse incentive to Slack all the things. Lots of CI notifications are sent through it. Some org processes are implemented as workflows. There's been talk of how wonderful it would be to hook up tasking and work tracking to slash commands. I and others often use Slack instead of the 'official' tool to video call each other.
An outage like this is still really disruptive. It's not like everyone realizes what's going on immediately or at the same time; we have backup tools, but our turn radius is pretty wide. Some of us can't even communicate effectively without memes, too, and backup tools don't have a giphy integration.
EDIT: Do your CI integrations fail if Slack can't be contacted? Do those failures fail your pipeline? Whoops!
Particularly on a Monday morning after a holiday, there are tasks that I know I need to be working on but cannot because relevant details were never transposed from slack to our actual work scheduling tools like google docs, jira, etc. and I cannot access Slack history.
If something went awry, and it caused more pain because Slack was down, how would you feel?
If you’re missing comms/observability then waiting to deploy seems prudent.
Not all - many workflows these days rely on Slack or its ilk. Benderbot, Jira/etc. connectors, calendar connectors, remote communication/standups, alerting…
If you use slack primarily as a water cooler then yes.
However, I drive everything through slack - GitHub, linear, calendars, Notion, support emails, etc. I have notifications turned off for every service we use except for slack. This allows me to effectively ignore everything except for slack. These types of outages destroy that workflow for me.
Absolutely! Before the holiday shutdown, I Slacked myself a huge reminder list of things to jump on as soon as we started up again, so that I could hit the ground running in the new year. Oh, wait....
It is easier to cache stuff for users who are not logged in as it is the same for everyone. and everyone is looking up on Hackernews at the moment to see what is wrong with slack, which is probably the cause of the slowness.
The point count for most articles is consistently lower on a view of the non-logged-in homepage. I assume that means they are cached more aggressively for non-logged-in. There's also the username and karma count in the top-right.
It has made coming back from a long Christmas vacation a lot easier. Once I got my emails taken care of, I was able to get to work without distractions. It's been nice.
Just a note, if your company uses G Suite, chat.google.com exists and is basically an entire Slack clone. We use it as a backup when Slack goes down (obviously doesn't help for bots and ChatOps we've set up, but works well for realtime work chat).
This is an excellent reminder of the danger of being locked into closed systems.
I wonder how many companies (like mine) have literally ground to a halt because of this? Do other companies have a risk-documented backup plan B for times like this? Presumably the default is for everyone to resort to email?
More worryingly is the number of ChatOps processes and alerting/observability systems that are in place around Slack.
Not being able to chat with co-workers for an hour or two is fine, but not being able to safely manage CI/CD/deployments is a big risk.
When application engineers say stuff like this, they're also implying that there's a giant infra/ops team who will be wiling and able to do all the work for them. Nobody actually wants to be responsible for this stuff.
Not at all, I think closed private systems are far better (better products, support, service) but when an entire company runs its operations on a single system like Slack, there is a big risk when it goes away and you need contingency.
I’d still rather be on Slack and suffer a day of lost productivity than force people to use only email or IRC.
I agree 100%! Though I think it might be dangerous to prepare for the "last disaster". It'll be some other system breaking next time, so I think we instead should identify what systems that do not have some kind of redundancy and determine blast radius of those crashing
I'm good at not panicking about things I can't change, but I worry about some of my colleagues who find it difficult to not have control in these situations
I can't do anything to help them at the moment, so for now I'm heading to my couch with my analogue book :)
A lot of organizations essentially took the last two weeks off from work, which is long enough for a 10-day autoscale window to spin down servers, and then get confronted by a load spike that wasn't pre-spun for.
What does this mean? What do cloud providers do when customers scale down their services? Do the providers literally power down servers? Do they sell the capacity to new customers?
Relax. GP is clearly referring to an increase in people signing on do to the holidays ending and everyone coming back from work.
Also, Slack has significantly more users in the US than in any other country[1], and it really isn't even close. So the offense you're taking is unwarranted anyway.
Slack makes ~61% of its revenue from US customers which only has 4 time zones compared to the remainder of their revenue being spread out across ~20 time zones. It's not an unreasonable hypothesis.
- Revenue is not same as users. Slack have tons of free users and some countries also has lower priced plans.
- Many companies like Amazon etc. probably is counted as US revenue for Slack but they have more than 30% of their employees outside the US. This should not be huge numbers but significant.
Using our teams backup chatroom in a competing service. One of these days P2P Matrix will reach GA, then I plan to make a backup for my backups, Starfleet style.
GILORA: Starfleet code requires a second backup?
O'BRIEN: In case the first backup fails.
GILORA: What are the chances that both a primary system and its backup would fail at the same time?
O'BRIEN: It's very unlikely, but in a crunch I wouldn't like to be caught without a second backup.
Makes perfect sense for O'Brien, DS9 had serious backup issues in the first couple of years
The Forsaken (season 1 episode 17)
LOJAL: I've been reading the reports of your Chief of Operations, Doctor. They gave me the impression that he was a competent engineer.
BASHIR: Chief O'Brien? One of the best in Starfleet.
LOJAL: Then why aren't the backup systems functioning?
BASHIR: Well, you know, out here on the edge of the frontier, it's one adventure after another. Why don't I escort you back to your quarters where I'm sure we can all wait this out.
Rivals (season 2 episode 11)
KIRA: My terminal just self-destructed.
DAX: What?
KIRA: I lost an evaluation report I've been working on for weeks.
DAX: Even the backups?
KIRA: Even the backups.
There's a reason to have a backup to the backup by Destiny (season 3 episode 15)
> Customers may have trouble connecting or using Slack
I can't stand how marketing speak pervades every sphere of the world. Their entire system is offline (inconvenient certainly, but it happens) and they can't bring themselves to say "Slack is down. We're working on it and will be back ASAP." or something similar. Instead we may have trouble.
The funniest part to me is that their status page still says "Uptime for the current quarter: 100%". These uptime messages are so BS. Heroku reports 6 9s of uptime for this month, even though their own status page shows multiple days with incidents >6 hours
How do you know it's down completely? Maybe it's down for you and maybe even down for a majority but still up for some subset. Happens with many products.
It's not entirely offline though. I was connected via my phone ~90 minutes ago when I first got online today and never had any issues and was able to tell folks at work my PC connectivity may be spotty for a while. When I signed in via my Mac laptop I wasn't able to connect for about 20 minutes, and was redirected to the status page. I've been online for about an hour now.
Why do you consider that to be "marketing speak?" It appears to be concise, direct, and accurate. The phrase "Slack is down," even if true by some interpretations (it hasn't been "completely down" from what I have seen), is imprecise and informal.
There's a wide gulf between "some customers may have trouble using Slack" and "most/all customers are completely unable to use Slack". Putting aside formality, I'd say "Slack is down" is in fact more accurate here (assuming that it is true that most users can't use it, which is true for our company at least).
But 1) it has apparently not been the case that the service was "absolutely inaccessible" and 2) "Slack is down" is still very imprecise and not a great alternative even if the service had been "absolutely inaccessible."
As someone in marketing, it's a little bit of this, and a little bit of determining what the most default, catch-all statement could be well ahead of time to make "crisis comms" that much smoother.
I find it hilarious that the status page is still saying the uptime for the current quarter is 100%. I'd think it'd have lost at least one 9 by any obvious definition of "current quarter".
I'm still logged in on mobile and can communicate with people from my team, but cannot log in from desktop. With so few people able to connect, it's also unclear whether Slack is eating my messages or there's just no one to respond. So I'd certainly rank that as "trouble using slack" rather than "the system is completely down".
I agree with you in principal, but I have had no problem connecting to Slack today (I have a free one I use with friends, not a business account) so to say they are down would also be inaccurate.
This is probably for legal reasons, i.e. Service Level Agreements. "May" leaves the door open to other interpretations and reporting from other systems.
For the record, I am logged in and have exchanged messages with at least one other person. The rest of my team does seem to be unable to get in though. Maybe it's because I have just had the Slack tab left open in my browser since before I left for Christmas?
Chat infrastructure at this level of scale is not easy to build and maintain, I appreciate all the hard work that the engineers at Slack are putting in to resolve this.
My business coworkers are freaking out over Slack being down. But all my technical coworkers are nonplussed. It's interesting how those of us with a technical background are not too disturbed by things breaking.
I've never lived in the mid-west or the New England region of the USA. Maybe it's a regional usage (I've lived in Florida, Texas, California, Colorado, Utah, Oregon, and Washington). I'm not sure where I picked up my usage from. My dad is from Colorado and my mom from California. Maybe I picked it up from one of them ;-)
I'm "plussed," because an app that I manage uses slackclient, and some people depend on it to get paid. Obviously it's my fault for not handling the error, and I hotfixed it, but still, wah.
I'd be a little nervous if I'd recently bought Slack for $20B.
It's not like there aren't alternatives. You could even imagine someone has a live bridge between Mattermost and their Slack team, making the switchover seamless.
Why be nervous? Outages happen. If this were a string of major issues over a few weeks or months, that might be cause for concern, but a single incident is not.
Notion is sluggish as well. That combined with the reports of HN potentially being slow, is there some larger network issue at play affecting a region of servers, potentially?
My feeling is some common infrastructure is failing or flailing, like some part of AWS, or some backbone provider. Too many flaky things going on at the same time to be independent failures.
My company monitors EC2 performance and availability across North America, and EC2 has been fine this morning, according to our data (that said, they had some intermittent issues the last 3 days).
Maybe another internet routing issue, where a bunch of traffic is going through some guys router in Albania. Or even someone is actively interfering with a root server.
Does Google use slack? Wanted to start my year with some extra strength tinfoil and it'd just be great if the day a unionizing initiative started the major way workers could talk about said unionizing initiative went down.
EDIT: according to a random quora post they do, so keep the tinfoil out!
people actually argue this? slack is a great coms tool, and great BACKUP if you can't find something in a real documentation/knowledge/etc.. repository.
Where I work yes. Small and scrappy devs who when asked to move knowledge into a Confluence or wiki page, argue that’s it’s too much work to find everything they need. They can just type in the channel they want to search in with a term and get the conversation they need.
My response is if it’s important long term, it needs to be somewhere visible and exportable should the platform change. As it is now, Slack exports are horrible and large.
Unfortunately, Down Detector doesn't actually monitor these services, so we don't know if they are truly down. Down detector relies on human behavior, and we all know humans don't act rationally.
Which is great to detect common issues across many companies. For example, clicking on the cards shows that many of them are related to "network connection".
Funny how status.slack.com has reported Incidents and Outages for a while now, but still the "Uptime for the current quarter" is reported at 100% on the bottom right of the status table.
(And if you're saying that according to the legal blah blah blah of the SLA that this isn't technically "down", then there might as well not be an SLA.)
> And if you're saying that according to the legal blah blah blah of the SLA that this isn't technically "down", then there might as well not be an SLA.
I am because Ive had these exact conversations with cloud hosted providers/products. Never once have we been refunded according to the SLA in our contracts. Never really down (according to legal).
It may depend on how they define the "quarter". If they take the quarter as the last 91 days and round the number to the closest percent, you might not see it changed unless the outages go more than 91x24x0.5% or 10.92 hours.. It's quite subjective and a guess.
Could this be some sort of data corruption? I find it hard to believe that Slack could be down for this long without something that is exceedingly hard to rollback. Even if some services are completely overwhelmed with traffic, they could block a certain percentage of traffic to decrease load, and then force servers up across their datacenters and then unblock traffic. It has the hallmarks to me of some sort of datastore is down, but obviously just a random guess.
When it went down fully and I had the Windows client open, it went to a page that basically said "Slack is down, we don't know why, try restarting and see if that fixes it. Here's the status page."
It would be nice if they could fix it so that a fresh start also goes to that page, at the very least.
How do you have the Slack app installed? I currently have it installed via the Windows/Microsoft Store, and I suspect that is a significant part of the problem.
> Customers may have trouble loading channels or connecting to Slack at this time. Our team is investigating and we will follow up with more information as soon as we have it. We apologize for any disruption caused.
- Jan 4, 10:14 AM EST
The status for messaging and connection services has been marked as [incident]
> We're continuing to investigate connection issues for customers, and have upgraded the incident on our side to reflect an outage in service. All hands are on deck on our end to further investigate. We'll be back in a half hour to keep you posted.
- Jan 4, 11:20 AM EST
> There are no changes to report as of yet. We're still all hands on deck and continuing to dig in on our side. We'll continue to share updates every 30 minutes until the incident has been downgraded
HN becomes slow because people notice a service is down, and go to HN to check for more info. When Google was down for an hour a couple weeks ago, HN became almost unusable.
This is actually impressive, in a bad way. I just have become so used to being able to run highly resilient cross region infrastructure for millions of users with just a handful of people that I forget what real downtime looks like.
For their app to just go completely offline is unacceptable. Bugs and degraded services I get. But this is catastrophic.
>I can't even begin to guess what went wrong. What are your guesses? How many screaming executives are there at Slack saying "just roll it back"?
Doubtful it's a code issue causing a total system outage. I'm assuming they have a bunch of auto scaling infrastructure that wound down over the holidays and couldn't take the spike this morning.
Assuming this is a bad deployment--not hardware/network issues: It will be interesting to read their post-mortem, on why rollback still has not happened yet after 2 hours of outage. You would hope that a service the level/popularity of Slack would plan for deployment-related outages and be able to roll back a deployment.
And there we have it: Relying on big companies sucks. It's great as long as it works. Once a system breaks thousands, or even millions, of businesses suffer. (Of course they are also beneficial and a private server can also crash at any time + I don't wanna blame Slack, but we always have to keep this in mind).
If a big company has million customers and the big company experiences an outage per quater, then a million businesses suffer every quater.
If a thousand small companies have thousand customers each. And these small companies experience an outage per quater, then a million businesses suffer every quater.
As the end-user-business, is it better to suffer the outage at the same time as other businesses? Is it worse?
Surely there are valid arguments against relying on big companies, but I don't think this is one of them.
> If a thousand small companies have thousand customers each. And these small companies experience an outage per quater, then a million businesses suffer every quater.
Not all companies are created the same. Microsoft, Google and Facebook have had their outages, but IME much fewer than Slack.
If there are a thousand small companies, none of them have a network effect, and those that experience more outages per quarter will lose customers to those that have less outages per quarter. So they have much more incentive to improve.
Whereas network-effect beneficiaries like Facebook (and to a lesser extent, Google, Microsoft and Slack) have much less of an incentive to improve. Who else would the customers go to?
Just a note to say "thanks" to the Slack team for the uptime when Slack is not down, it's been incredibly useful as a tool to me when other enterprise systems (Teams, Outlook & co.) have been down over the last couple of years, and especially throughout 2020.
Somehow Slack is very resilient in general. I also appreciate its UX/UI being far superior to Teams.
Ultimately, the cloud is often a single point of failure that companies become over-dependent. So I'd favour a free (as in freedom) and open source self-hosted/deployed alternative if there was one (even if it was from Slack and for pay).
I agree with most on here that there isn't such a thing yet - but it's well worth building! So those of you out there who are considering implementing "yet another text editor", maybe this is something to work on.
So many large scale downtimes across multiple large companies in the past month or so. Is this for a bugfix deployment for the SolarWinds hack, or downtime caused by the hack itself ? Or some state-sponsored orgs installing upgraded eavesdropping stuff ?
I've had good experiences with ngircd. It's an IRC server that is very easy to self-host, and it can be installed via APT on any debian/ubuntu/raspbian etc system, and I'm sure on many others.
https://cabal.chat/ is a good program. It does not support all of slack's features, but is truly peer-to-peer so there's no central points of failure or servers that can go down. (Well, I suppose if they released a buggy version of the software and you updated, that's a central source, but that's true of most software.)
Todoist was having issues and iOS app launching from Xcode started taking a lot of time in the middle of the day (which reminds me of the app online check fiasco not so long ago).
If anyone is looking for an alternative way for fast and seamless chat with colleagues, friends, or strangers, you're welcome to check out Sqwok (https://sqwok.im)
Although it's built as a live news discussion site versus a team messaging app, the topics can be about anything, are public, and inviting others is as simple as sharing the url of the post (mobile/desktop web).
I have stopped using Down Detector as an accurate measure because a lot of "outages" are just people having issues with a service unrelated to the service they are reporting as down. Ex: AT&T outage in Nashville caused people to report Xbox Live as down, when it wasn't actually down, etc.
I have issues reaching a lot of sites, especially american.
Both downdetector, hacker news and others loads extremely slow or not at all. Downdetector had a bunch of failed resources for me..
Slack has been failing - hard - the past few months. Yeah, I get it, lots of remote workers - but Slack has had months now to prepare for an onslaught given the trends with COVID. Simply not acceptable.
> We're continuing to investigate connection issues for customers, and have upgraded the incident on our side to reflect an outage in service. All hands are on deck on our end to further investigate. We'll be back in a half hour to keep you posted.
> Jan 4, 5:20 PM GMT+1
I'm positive that they have internal monitoring, and probably knew about the issues well before they decided to manually update their status page to reflect the issue. Manually updating the status page does not equal no monitoring, after all.
For a product that is so simple, there are no good self-hosted alternatives. Mattermost and RocketChat are written very poorly, reliability and getting your data out is impossible.
Slack goes down so often we're thinking of writing a very boring clone that uses ActiveMQ and MySQL, just because chat should be boring and needs to "just work".
I was just considering setting up a Mattermost instance for our company since I used it for a year at a previous job without any issues (I was just a user though, I didn't deploy or maintain it). Just curious, why do you think it's poorly written or unreliable?
We tried running it, so a lot of experience with it, and it wasn't great. It barely stayed online.
For something so simple, you have to run a massive server, like gigs of ram and multiple core, even with a very modest user load. Take a look at the codebase, it's also a mess and impossible to fix any bugs. Finally, if you want to get your data out or report on the message activity, good luck, you'd be better off passing paper notes around. The open source version is nerfed a bit too, no LDAP authentication for instance, so it creates a lot of problems there too.
Seems that it is now. It was originally just Messaging and Connections that had an "incident", so I wonder if something else happened or they manually changed the status to at least own that all their services went FUBAR.
They always have been, since they clearly don't fit the guidelines for what a good submission is and usually leave little for interesting discussions. (unlike postmortems of past outages, which often are good)
Agreed, when a major service goes down HN is the most accurate overview, often a useful sanity check when its AWS or Slack size orgs before I open an incident with whichever party.
HN is where we all go when the Internet (or large portions of it) are down. It's more reliable than all the 'downforeveryoneorjustme' or 'downtime monitor' services.
We're continuing to investigate connection issues for customers, and have upgraded the incident on our side to reflect an outage in service. All hands are on deck on our end to further investigate. We'll be back in a half hour to keep you posted.
I am so glad that at least today I do not hear that slak annoying sound.
I do really think Slack is not helping me, at all, to concentrate on my job (system administrator): synchronous messages are really the worst, ever, while working: email is much much better
Honestly I'm fairly sure the vast majority of "technology" we've deployed, as an industry, in the past 10-15 years has actively made life worse. I don't know about anyone else, but that's the opposite of why I got into technology.
> which effectively IS email for many, many people
Doesn't have to be, though. One person doesn't even have to tie their address to a single provider, and seeing past received messages doesn't even need internet connectivity.
I'm able to connect to Slack at the moment. My company doesn't use it, but a hobby group I belong to uses it for discussion forums and their instance is up and functional. So it isn't down completely as I write this.
Nothing like a reminder of how dependent you've become on Slack for communication (and archival of conversations) like an outage on the Monday after the holidays when you're not on your A-Game yourself.
"Let's see, I'll look up so and so's name with Sla.... shoot"
"Okay, I'll just find that thing I .... nevermind"
> We’re still investigating the ongoing connectivity issues with Slack. There's no additional information to share just yet, but we’ll follow up in 30 minutes. Thanks for bearing with us.
I think this is due to AWS. Not only Slack is down (e.g. Notion). AWS status page didn't show anything yet, but wouldn't be the first time. The last Kinesis crisis didn't show up for hours.
Can't handle the post-holiday surge, or people wanted to justify their long holiday and pushed something only to witness their holiday optimism head-crashing on the surface of the reality?
We got a discord server as a backup when teams is down, seems like it has gotten worse with days being down entirely, and we have to resort to discord voice which always seems to be up..
If you're using gsuite already, it's a usable failover. I send all my alert notifications there, as a fallback already. Dragging people in was trivial. It's better than the group SMS that one person tried to use.
Subjective... I've found Slack and co's interspersed conversations far too chaotic, and temporal; threading is a great way of organising many different concurrent topics.
And to be clear I don't mean Slack's implementation of threads which is hiding it away in a separate panel and which doesn't get used by everyone either.
This is my SignalR alternative with end-to-end encryption. Choose a password and the file and message will be encrypted in client side using that password.
It seems weird to say there are issues with connections but everything else is working fine. Like is the API technically fine on their system metrics but no one can connect to use it so it stays as green? Doesn't really help much in practice it seems if connections are having issues and everything is unusable in practice to keep them as green.
Would be similar if auth was down. You can connect to us, you just can't authenticate so can't actually do anything.
Edit: Looks like they updated the status to properly show an across the board outage
Considering how ubiquitous slack use seems to be in a lot of major tech companies, I wonder if it's reasonable to ask whether or not the stock market's performance this morning is somehow correlated?
I personally have found that one of Discords major shortcomings is the lack of support for threaded message chains. For those times when you may have 2+ parallel conversations in a channel you end up dramatically reducing the ability to effectively communicate.
it's not based on only that. Slack costs a lot of money and moving off of it is something that has continually come up over the last year or two. We even had a Rocketchat server up and running for awhile.
My old company used a mix of Slack and RocketChat. Functionally, it's fine but I was never a big fan of the UI and how attachments were handled. Also, cross-channel search was kinda bad. Mind you, this was well over a year ago so I'm sure things have improved.
It isn’t just you, but due to Hacker News’ right-sized[0] infrastructure, you should sign out unless you need to comment. That way you hit the caches instead of getting the server to make you a new page.
That’s the wrong way to look at it. If HN struggles in certain situations then it is not right-sized. You don’t beg of users to walk an unintuitive happy path (i.e. logout when not commenting).
Other than it being days of slow news, with top stories seemingly pinned for days now and boring, no ;) you know, where Slack being down is considered newsworthy (yawn).
We were just joking with the work mates -- SF bought Tableau 2 years ago and haven't ruined it yet, only because it takes them that long to do anything ;)
https://news.ycombinator.com/item?id=25632346&p=2
https://news.ycombinator.com/item?id=25632346&p=3
(Yes, these comments are an annoying workaround. Their hidden agenda is to goad me into finishing some performance improvements we're badly in need of.)