https://news.ycombinator.com/item?id=16108912 - 5 months ago (longer discussion)
https://news.ycombinator.com/item?id=15597461 - 7 months ago
https://news.ycombinator.com/item?id=15597431 - 8 months ago
https://news.ycombinator.com/item?id=13811815 - 1 year ago
https://news.ycombinator.com/item?id=10616743 - 3 years ago
Their expertise and discipline in distributed applications is unrivaled. I'm guessing because they have datacenters everywhere with huge fat pipes in between, and their SREs are probably top notch who don't take shortcuts.
Google gets a whole bunch of things wrong at times, but somethings I gotta say, they've nailed it.
Take Google Search for example. When there is downtime, results might be slightly less accurate, or the parcel tracking box might not appear, or the page won't say the "last visited" time beside search results.
The SRE's are running around fixing whatever subsystem is down or broken, but you the user probably don't notice.
Driving features with microservices makes it easier to isolate their failure and just fall back to not having that feature. The trade off is that monoliths are generally easier to work with when the product and team are small, and failure scenarios with distributed systems are often much more complex.
An analogy to your Google failure examples for slack might be something like the "somebody is typing" feature failing for some reason. In an SoA you would expect it to just stop working without breaking anything else, but one could easily imagine a monolith where it causes a cascading failure and takes the whole app down. Most services have countless dependencies like this.
I admin a number of GSuite accounts, and we experience fairly frequent (~monthly) periods of strange behavior with Hangouts/Meet, and Google Drive.
Fortunately Google is very good about providing updates via email to administrators as they're working through an issue.
Every service/team has to go through a period of growing pains as they learn, improve, and fix the code to be more stable. You can't just take the learnings from one service and apply it to another, it has to be architected and written into the code and most teams start each new project/service with fresh code.
I'd love to hear the story of how that made it through testing.
A fresh launch of Messenger worked until you switched out and put it in the background. When you tried to resume it (either from home icon or task switcher) it would immediately die and could be launched fresh on the second try.
Basically every time you wanted to use it you either had to kill it in the app switcher and then launch it, or launch it twice.
My favorite part is that since Facebook doesn't do useful release notes (best guess because they're testing different features on different users and changes never actually land for everyone in a specific version), all the App Store said for the busted version was "We update the app regularly to make it better for you!" Oooops.
Though that's an interesting thought, I wonder if a feature had rolled out to a subset of users and it was crashing because it tried to pull some piece of account info that doesn't exist on accounts without it? Testing still should have caught that, but if the test accounts were all testing the new feature I could see it sneaking through. From my end it looked like a 100% reproducible crash on resume which is pretty sad to release.
They use IRC.
> Google has found IRC to be a huge boon in incident response. IRC is very reliable and can be used as a log of communications about this event, and such a record is invaluable in keeping detailed state changes in mind. We’ve also written bots that log incident-related traffic (which is helpful for postmortem analysis), and other bots that log events such as alerts to the channel. IRC is also a convenient medium over which geographically distributed teams can coordinate.
How many SREs are at Slack, working on keeping their systems up?
Finally, how many SREs could your company dedicate to keeping an internal IRC server up, and supporting it as an internal product?
I can throw ircd on a server; no problem, but there's a little bit more to 6 nine's of uptime than `apt-get install`, the decision wether to use IRC or not should keep in mind Google's resources (in number of people, number of data centers, and amount of money to throw at redundant hardware) to make sure it never goes down, especially when the data center is on fire around you.
 note that including this outage, they are reporting to have missed their monthly uptime guarantee 3 months in a row.
This says a lot about how it's a single point of failure in modern company comms.
It's even worrying to think about how some users probably have production-dependent (dare I postulate it) workflows in Slack that get crippled by its outage...
ITT: Chat about decentralisation that will ultimately lead to no action.*
*Because we've had this discussion so many times before...
Although with outages like these, I doubt it!
Well, you can probably infer the former from the dependency on the latter. You use these tools because they can reduce the scrambling when shit does hit the fan, not because they are necessary.
- You don't get GitHub notifications on pull requests and comments, so things don't get reviewed and merged if developers aren't in the habit of checking the PR tab on GitHub itself.
- You don't get CI notifications so you won't know how your latest test run or deploy is going without going straight into the CI service itself. Even worse when there's a failure and you're too used to having Slack warn you about that.
- Your team might depend on Slack so much that they don't know how else to efficiently communicate, and the most efficient channel to communicate a fallback is not available or rarely checked (e.g. email, face to face). So you get a lot of chaos as people come up with dozens of alternatives.
This is just poor discipline more than anything, putting too many structural eggs into one basket, but it doesn't change the fact that Slack has created that dependency.
If anyone on my team came to me and cited Slack being down as a reason for their inability to do their job, then they wouldn't be on my team.
Is it less than ideal? Yes. Is it a little bit less efficient to pull info instead of having it pushed to you? Yes.
Is the sky falling? No.
* your hdd failed? Use a raid
* your power went out? Use a UPS
* your DNS went down? Use a fallback (slack2)
* your whole datacenter flooded? Good thing you have multiple replicated cloud instances that seamlessly take over
See, these are the issues that "the cloud" was supposed to solve. Not give us the same problems as before, just with a recurring bill for "chat as a service".
And inb4 "chill Mike it's just a chat server not life support firmware" yeah but slack is the most trivial software you can think of: send text from one computer to another. I see no reason this service can't be nearly as reliable as life support firmware in 2018. We've had over 30 years to get this right. Raise the fricking bar.
This is like saying that food service at 30k feet in a passenger airline is trivial because all the server has to do is walk up and down a narrow aisle handing out food from a cart.
Since "you see no reason this service can't be nearly as reliable as life support firmware", one of two things must be true:
1) You know something nobody else knows. In which case great, you've stumbled on a huge opportunity to go put your knowledge to work and get stupendously rich by outcompeting this "trivial" software company. Get to it, genius!
2) The reason you "see no reason..." is that you're unaware of one or more relevant facts.
Which of these do you think is more probable?
Eventually somebody is going to die because their pacemaker decided to throw cycles at mining monero.
2) Geodistributed datacenters is VERY expensive thing, so not implemented fully.
3) Bad system design, full of "one point of failure".
You buy servers on aws-us-west and aws-us-east, and sync them . How is that very expensive?
There was no accusation of "so easy", only so not expensive and supposedly (and previously, demonstraby) solved in the last 30 years.
They may well be "hard" or even "expensive" for some definition of those two words, but if it weren't, it would defeat much of the (stated/advertised) purpose of outsourcing/cloud.
Geodistributed datacenter requires multiple direct low-latency multigigabit/sec connectivity, special software to manage, test and check it, skilled devops.
Specifically, you risk people piling on that rsync isn't good enough in the modern world and referencing the comment criticizing Dropbox as being little more than an rsync replacement .
Of course, the specific tool one uses is irrelevant. The data synchronization problem may not be well solved, but it has been very well studied, with a remarkable number of good-enough options.
So, no, there isn't just one "sync" button, as the parent comment snarkily suggested, but there may be two, one where you might lose the last N seconds of chat (perhaps temporarily) and another where you lose the ability to chat entirely for those N seconds.
 Although it had other criticisms, such as monetization, which are, naturally, ignored.
This all goes back to one basic fact: The Cloud is Someone Else's Computer(tm).
If your hosted Confluence or Jira is down, you can go walk over to your IT team and they'll be like, "Yea we know. We broke something. We're working on it." If you're using a hosted (a.k.a "Cloud" solution), you're just kinda fucked. You can't even extract your data and try to run it locally if it's down (if that's even an option).
My experience with self hosted solutions is that they go down way more often and take longer to fix than cloud solutions.
Building the program with withstand failure after failure, of things in and out, of your control. Seems like Slack needs some chaos engineers...
It's an interesting morning, to say the least.
However, if you become dependent on chat-ops to do your job. (say: fallbacks for common things have eroded due to lack of use) then suddenly your company is crippled. And why? for a chat service? The value add from slack is grotesquely small in isolation.
Spot on, not that I know anything about Discord's architecture...
Somewhere along the shift to flat design, grays and non-bright colors have been ignored in the visual design of applications.
I love how having a dark theme is second only to "it works" in terms of how we pick services these days XD
If it were my company, I would say a little as humanly possible.
If I pay for your service, and the only communication is, "We know there is a problem, and we'll let you know when it's fixed", I may assume you are not equipped to thoroughly explain the problem, and therefore not well equipped to solve it.
The blame is already assigned. The users already know there is a problem. A post-mortem likely has a positive effect for the readers attitude toward the handling of the issue.
In other words: post-mortems are good PR, but incredibly bad advertising.
We learned a lot from the Gitlab outage. It was a simple mistake and not one they will have again, yet people still beat them up for it. I'm not sure the value is there for the company to be super open about their outages and issues.
In my experience, customers are often seeking an explanation/post-mortem because their customers are seeking an explanation. If an upstream service goes down for an extended period of time and all you can do is go back to your customers and say, "Your system was down because our provider's system went down for 4 hours. But they won't tell us why.", that not go over well.
Anecdotal, sure, but people like me exist. I don't know if we're in the majority. You'd have to measure somehow and do a cost-benefit analysis I guess.
I am not suggesting Slack or anyone else should not communicate at all when they have an outage. A public postmortem, which many people are asking for, is one method. Is it the most effective method? I doubt it. Many people are suggesting that as paying customers they would like to know what happened. Does a public postmortem tell the paying customer what happened in an effective way? Maybe, but maybe not.
When I am running a company I care very much what my paying customers think and are feeling about my service. I will communicate issues directly to them. Do I need to explain to the rest of the world in some great technical detail what happened during an incident? Absolutely not. Do I need to have the first post in Google about my company be an outage postmortem? Of course not. I need my PAYING customers to be pleased with the service I offer and to understand how I will mitigate the damage I have done to them. To me, that's a basic principle of business. I don't have to explain to everyone. I owe everything to my paying customers. Gitlab did a postmortem almost immediately after a major outage and some people tried to slaughter them with the information they shared. It was sad and unfortunate. Their openness was met with some horrible results from the community.
Also, I use Slack. My company uses it for everything including ChatOps for my production environment deployment. We have a hundred of so active users. The outage this morning harmed us. But you know what? I don't pay for Slack. I owe a lot to Slack but they don't owe me anything. I can't blame them for my problems this morning. They are a free service to me. I appreciate that their absolutely free service servers my company so well almost all of the time.
My point is only that it does not have to be a large public explanation. You, or the decision maker at your company, who pays a substantial sum of money to slack for their service, should have an explanation until you are satisfied.
Russia blocks a lot of AWS IPs, when I did a full VPN out to a server in Germany slack came good.
I recall AWS/GCP public IPs getting banned in Russia when they were trying to block telegram.
Which also makes their uptime totals completely bogus.
> Our team is still looking into the cause of the connectivity issues, and we'll continue to update you on our progress.
mIRC is pretty straight forward compared to that.
The reason is that when you sit there clicking through a bunch of menus to find something in QuickBooks (or a typical atrocious enterprise app), nobody sees you; and if you screw something up there, you spend some more time fixing it and nobody sees the screwup. Frustrating maybe, as you waste time, but almost everyone has some frustrating wastes of time at work.
If you're on IRC and people are talking at you and you sit there fumbling to figure out how to respond, it's like you're in a conversation and tongue-tied and everyone's looking at you. And if you screw something up, like send a message to the wrong channel... now you've done it in front of all your coworkers, in real time. Humans hate looking stupid in front of the group.
And if you screw something up on IRC in front of your coworkers, and you're someone with even a little anxiety about not being tech-savvy... that's going to flare right up.
Also, because now you're embarrassed, you're going to want something to blame. So you blame the tool.
QuickBooks doesn't have to suffer in comparison to better UX performing similar tasks in people's personal lives while IRC can be compared (unfavorably) to texting apps, Facebook Messenger, Twitter, AIM once upon a time, etc.
If you're setting it up in a corporate environment, just change the ini files so it autoconnects to your server. It'll pop up a list of channels they can join. The server can SAJoin them to particular channels on connection too. The UI is very clean and lightweight: a channel scrolls messages and they appear, there's an input bar at the bottom, and there's a list of users on the side. It's written in MFC and Win32 APIs, so it's blazingly fast compared to most applications, and you can find a version that will run on every computer made in the past 25 years.
The united states military used mIRC extensively for battle field coordination. I think it's up to the task of handling bob from accounting.
Even if mIRC would suffice for Windows, you've not handled Macs, phones, etc. Who gives a shit if it runs on a 25 year old computer?
The US military has produced some specific examples of good design but isn't known highly valuing usability, let alone whether someone would enjoy using a tool. IRC is very functional and mIRC appears to add a little polish beyond a pure command-line interface, those are bare minimums and not good enough.
> It has a sidebar to list channels but the current channel window is still an undifferentiated mess of handles, commands, and actual conversation.
Each channel and private message get their own MDI window you are free to minimize, maximize or layout however you want.
Notifications are turned on by default, but they can be disabled. You'll get a tray notification if mIRC is minimized, and inside the title bar of the window will flash. Notifications happen when your nick is mentioned.
There's a horizontal line that goes across the dialog window that indicates the location of the conversation the last time it was focused.
>Even if mIRC would suffice for Windows, you've not handled Macs, phones, etc. Who gives a shit if it runs on a 25 year old computer?
Other clients work on other platforms. mIRC is just what I brought up since it's desktop windows client and that the most common case for an office environment.
> The US military has produced some specific examples of good design but isn't known highly valuing usability, let alone whether someone would enjoy using a tool. IRC is very functional and mIRC appears to add a little polish beyond a pure command-line interface, those are bare minimums and not good enough.
It's a simple, light-weight way for people to send short text messages in near real time with tens of thousands of people. I think that's good enough, and it works at a scale that far surpasses the SaaS chat options.
Matrix has Riot (https://riot.im/app) but personally I find it incredibly confusing.
I don't think it's a problem if something needs to be initially deployed and configured by an IT department (or otherwise tech savvy individual or group), as long as its onboarding and primary usage flows are straightforward. An arbitrary non-tech-savvy but internet-familiar employee needs to be able to create an account, browse and join rooms, and search through history without any hand-holding. Slack and its direct competitors pass this test. IRC doesn't. Does Spark?
How’s that work exactly?
Edit: It’s now droppped to #14 even with comment count also rapidly increasing.
edit: it's a negative factor...
> I really don't understand these types of questions. The possible answers range from "because the ranking works that way" to "someone with privileges wanted it that way". On either end of the spectrum, the real question remains: so what? What difference does it make why a particular post is in a particular position? If the title seems interesting, you click on it. If not, you move on.
> I don't mean to question you in particular. It just seems like such a trivial concern to me that I truly can't understand why someone might possibly care.
@dang, care to comment?
That being said although I like some slacks fancy features I do wish a distributed alternative could catch on.
I prefer decentralized and open things, but a cohesive vision can sometimes provide a better user experience across a more restricted set of functionality than an army of hackers, each solving their own problems.
Large networks can have the servers go up and down, and it's still not a big deal because of redundancy. DNS round-robin entries mean you don't even have to know the other servers on the network.
In 2018 netsplits caused by down links are fairly rare. If you wait six months you might see one.
There’s nothing that gets around technical failure. Either you have a single server that’s going to die at some point due to sheer entropy, or you have a somewhat complex distributed system with the tradeoffs you desire that might fail anyway.
There’s a lot more that can go wrong than that a database falls over. In my experience, IRC servers fall over all the time - it’s just that nobody really cares because their clients just connect to the next server in the list and people resume their conversations a minute later after figuring out what messages actually reached their destination.
Paying IRCCloud to manage an IRC server for you is a reasonable option, but I wouldn’t do it because I think it’s going to be more available, but because I like IRC and believe it provides the functionality I need.
Don't use a 20 year old ircd then. Use something like ratbox or InspIRCd.
> or because you accidentally K-lined 0.0.0.0/32 or because you accidentally filled up the disk with logs because the server’s maintainer was fired and nobody remembers how the system works, or the latest system update borked something
Don't let 14 year olds run your server.
Back when I IRCd regularly (and perhaps this is still the case today), certain servers would get a reputation for splitting more than others, and I think this network (and/or its ircd) decided to mask it without breaking the general format.
I can just see it now. A company's app is dying from all the timeouts to a slack webhook, however they can't deploy because slack is down.
I keep getting automated push notifications from our bots (but I still can't connect to the app myself).
We use it for things like relaying user-flagged messages to our support team and reminding us when scheduled content has been automatically released.
They might as well have written:
- nope but maybe at some point a yep
- still nope
>We're happy to report that workspaces should be able to connect again, as we've isolated the problem. Some folks may need to refresh (Ctrl + R or Cmd + R). If you're still experiencing issues, please drop us a line
Hilariously, their "uptime in the last 30 days" still shows 100%.
IBM, Oracle and many large companies use it because 100,000+ participants in one workspace is quite unmanageable.
Think channel namespacing whilst unifying user provisioning and enabling DM and MPDM across the entire company. Users can have access to one or many namespaces, they sign in once and it populates all enabled workspaces into that users client.
You can share channels between workspace within Enterprise Grid fairly trivially (although this now works between Slack tenancies owned by different companies too!)
Still runs on the same infrastructure in AWS as other Slack customers though.
From a policy perspective you can push down settings to all Workspaces in your SEG, and define whether you “centrally control” or “delegate to Workspace owners” on a setting by setting basis.
- I am in UK
- I had similar problems last night (around 2AM GMT) but status.slack.com was all green, and my colleagues in the US seemed to be using it okay
- Currently it's completely down on desktop for me (waiting to reconnect...)
- Connecting through a US VPN does not resolve the problem on Desktop, even though my US colleagues are using it on Desktop successfully right now
- Mobile works for receiving and sending messages, but there is a delay
Anyone else seeing symptoms like this?
> We have no new information to share just yet, but we're continuing our efforts. Your patience is truly appreciated.
Or Atlassian's Stride is really great too: https://www.stride.com/
If your org happens to be part of the Microsoft Office 365 ecosystem, there's Microsoft Teams. All of the products support bots, gifys, and emojis. I personally think Google Chat and Stride are much faster than slack too. I haven't tried Microsoft Teams yet.
/me throws shoe at slack failed with the error "ASSocket: timed out reading 4 bytes from adminserver-3wvr:10443"
We've switched to google hangouts as an ad-hoc workaround..
The protocol itself is federated, so you can communicate with other Matrix users from your self-hosted instance. There are also bridges to IRC, XMPP, even Slack..
They also have a Slack alternative, called Stride.
There’s stuff like Rocket Chat and I think Dropbox released one as OSS a while back resultant from an acquihire, but the name escapes me now.
At least it happened before we've migrated.
I love IRC and XMPP. I'd love to run one of those, or some new service (Matrix?), at work. However, my time is arguably better spent doing anything _other_ than maintaining such services, and the same goes for most engineers at most companies, sadly.
Side factor: the mobile clients for IRC and XMPP almost universally suck, at least on Android. I imagine if those problems had been solved in a reliable way, more companies may consider them (assuming the allocation of engineering resources problem isn't a problem).
- Scheduled electrical maintenance that facilities manager failed to disclose (even though they knew about it for weeks).
- Emergency power-down because two of the four air conditioners failed at the same time.
- Someone accidentally powered off the VM.
I'd much rather have an hour long outage here and there than incur the cost of defending against these circumstances (and still have it go down for some new unforeseen reason).
how is that self-hosting when you don't control the hypervisor in this case?
it usually implies that you at least have some sort of control. either having a real server somewhere (with ups and stuff) or at home, where you know when power is out.
while what you are doing is technically self-hosting, I would have changed the VM provider after the first incident like you described.
most people are much better off letting dedicated teams of tens of people take care of the hosted version for them.