Hacker News new | past | comments | ask | show | jobs | submit login
A terrible, horrible, no-good, very bad day at Slack (slack.engineering)
631 points by ceohockey60 10 months ago | hide | past | favorite | 270 comments

TL;DR First a performance bug was caught during rollout, and rolled back within a few minutes. However this triggered their auto-scaling of web apps to ramp up to more instances than a hard limit they had. This in turn triggered a bug in how they update the list of hosts in their load balancer, causing it to not get updated with new instances, and eventually go stale. After 8 hours the only real remaining instances in the list were the few oldest ones, so when they then scaled down the number of instances again, these old instances were the first to be shut down, causing the outage, because the instances that should have taken over were not listed in the stale load balancer host list.

It almost seems like no one group understands the system as a whole, so when one part fails, no one has a clear idea of the domino effects that can happen. I'm guessing this is the result of really complex systems interactions.

Which is what always happens when the company aims to build an "engineering playground" full of microservices and other moving parts without a clear technical justification and balancing the pros & cons and why I personally don't like working on such projects - it makes me feel uneasy not having a good understanding of the entire system.

To be fair to Slack, at their scale, lots of moving parts might make sense, but I see a lot of companies (including startups with very few customers) going down the microservices route and exposing themselves to such a risk when there is no major upside beyond giving engineers lots of toys to play with and slapping the "microservices" and related buzzwords on their careers page.

I think you are drawing the wrong conclusions here... Microservices is not the boogyman here. It more likely has to do with speed of development, developer turnover and a plethora of other things that result in insufficient knowledge transfer.

Microservices (like just about anything) can be implemented well or poorly. There's a reason we have sophisticated orchestration solutions like Kubernetes... it exists to tame large scale deployments that has sensible failover processes.

The benefits you get are services that can be scaled independently, deployments that only affect isolated pieces of code, horizontal scaling, dockerized environemnts, etc. All of these advantages should exist in well designed systems, but systems that have been executed hastily will likely have critical problems crop up at some point.

I think that all other things considered equal, a microservices architecture has more moving parts and is more complex to learn than a monolith. The same applies for Kubernetes, it's one extra layer of abstraction and moving parts that you need to understand and keep into account when developing or making changes to an existing system.

I am not saying that microservices is a problem for Slack (at their scale it can make sense), but I am expressing my overall concern about smaller companies going down the same route when their scale or the problem being solved doesn't justify it and they end up having to deal with the (self-inflicted) problems of a massively distributed system with no major upside as well as my personal opinion of why I feel uneasy working on systems where I don't have a full overview of how it works and its potential failure modes.

When it comes to the benefits of microservices I am not sure whether those are all worthwhile considering the overhead and extra complexity of development on a microservices architecture.

In a monolithic application, most data and functions you might need are just a function call away and you typically have one or a handful of databases to interact with, often abstracted away by an ORM. In a microservices architecture, you suddenly need to worry about serialization, authentication and communication between services (and its failure modes, etc) and might require coordinated changes across several services, each of which might use a different language/framework and deployment process.

In terms of getting started, it has always been easier for me to work on a monolith where the codebase makes up for bad/no documentation because my IDE can resolve the majority of the symbols and allows me to see where the data I need lives and where it's being used. In a microservices architecture all of that goes out the window and you need to do a lot more manual "discovery" work searching through the documentation (if there is documentation, which is not a given) and manually figuring out the RPC calls because IDEs typically can't resolve cross-service communications.

Running a monolithic application locally is a lot easier than a microservices architecture. For the former you can typically get away with just a database and cache server all running natively. The latter pretty much imposes a container-based stack where you are now running 10 databases, caches, reverse proxies and everything involved around service discovery, which adds yet another layer of abstraction and makes you spend more time on this useless plumbing than actually getting work done and delivering business value.

Kubernetes is for scaling, the microservice complexity is about the spaghetti relation graph between them, the mis-consistency effects and so on. Believe me, you better have a infrastructure/development process that justify them.

I like your statement but I don't think it has to do with the speed of development either.

In a monolithic architecture the devs that deal with it, have to deal with the program as a whole. So if something doesn't work, its their problem. Where as in a micro service architecture it can be easy to spin up a service and not know the systems that integrate with it.

The problem here is with documentation and understanding of architecture. Its just the nature of the beast that the monolithic dev knows how thing communicate with the monolithic program because he needs to know, in order to do his job. In this instance the problem isn't with micro services, its with the execution. And that execution is a very easy trap to fall into with micro services.

The same thing can happen in a monolith. Make a change, run tests, and make sure your new feature works, while breaking some untested behavior you didn't know about in some other part of the monolith.

Obviously he means boogeyman. Even on your page, it suggests another spelling. What a useless comment.

Its a load balancer bug. Thats it. Everything else functioned totally normally from what I can see. I think people are blowing this out of proportion saying everything is so complex or its all microservices oh no!

This exact bug could have bit a large monolithic app running on a VM.

No, its not a load balancer bug. It was a bug with the home built program they built for managing config load ins TO the load balancer. The load balancer did exactly what it was instructed to do--but the slack developed app for helping the load balancer deal with their overly complicated back end discovery problems failed to keep the load balancers updated with accurate information on backend servers available for serving.

My sentiments exactly.

Feels like you've just described every single bug ever.

Even at the start I'm sure Grace Hopper didn't anticipate a computer with a moth in it.


Woods' Theorem: As the complexity of a system increases, the accuracy of any single agent's own model of that system decreases rapidly.

from: https://snafucatchers.github.io/

Possibly, but I think it's equally if not more likely a result of "agile" and teams not communicating due to product owners and the like not wanting other teams to work together etc.

Microservices are really cool™

This is far from a micro service problem. I work on a medium sized monolithic rails app and regularly I have issues due to not understanding the whole situation. Even when working in a small part of the app I know super well sometimes other devs change things so the way it works in my mind is not how it works anymore.

TL;DR meta point is: "The reason that we haven’t been doing any significant work on this HAProxy stack is that we’re moving towards Envoy Proxy for all of our ingress load-balancing"

The legacy system supporting Slack in production was heavily resource-constrained as they were moving to a new fancy system. Slack admits here that the legacy system likely wasn't getting the attention it needed and lo-and-behold it started failing in mysterious ways.

Organizational failure by not properly calculating all the risks caused by rotating out part of their load-balancing system. They probably should've asked for more budget here to keep their existing system functional as they slowly transitioned to their new system.

They admit that COVID caused all their systems to become stressed, they probably had appropriately budgeted for the transition to Envoy whenever they asked management(probably pre-covid). The team likely was never meant to support both the load they're now seeing during COVID while transitioning to a new system.

Either way during any transition, there's a period where you must support both systems at full capacity until the legacy system can be gracefully decommissioned.

I really feel this comment. Especially during transition it’s hard to see the “to-be legacy” system as worthy of the effort because the new stuff will be here so soon! Until it isn’t because life happens and you’re left with a really shaky platform. These are tough investment decisions especially when resource constrained. As a rule, perhaps ensuring that current state platforms are secure, as you say, before attempting a migration is the best way to go, but one person’s critical work is another person’s “too much insurance.”

That's an excellent TL;DR, thank you!

Postmortems should start with a summary paragraph like the above, and then go into story and full details below.

BLUF style–Bottom Line Up Front. Often used by the military for field action reports.

The internal Google postmortem format does require a 1-2 sentence summary. Absolutely vital for being able to learn from mistakes. No one is going to parse walls of text when browsing through post-mortems.

Here is an example of such a postmortem format if people wonder: https://landing.google.com/sre/sre-book/chapters/postmortem/

This. The linked post is unreadable.


I used to work for an Independent System Operator (ISO)[0] and we used to have a sometimes painful process for rolling out any changes to production. It's been a while and I don't remember it all, but it went something like this:

1. Fill out out a change request (CR) form and print it.

2. Have it signed by your manager and the managers of every system it touched, including business owners.

3. Attend the 2x a week meeting and explain your CR. In this meeting, explain what was happening, why, who authorized it, what to do if it failed, what to do if it initially worked but failed later (e.g, on a weekend).

4. Hope your CR passes the vote.

5. Implement your roll-out plan.

This is a robust process. Where it breaks down--I felt--is when you need to fix a typo on the public-facing website that's managed by a CMS.

0 - https://www.eesi.org/files/070913_Jay_Caspary.pdf

Thanks! They should include this TL;DR in their post!

Seems the root cause is that instead of using an established system like Kubernetes, they rolled their own and didn't engineer and test it as well as such an endeavor would have required.

"I'm still not understanding why it's so hard to display the birthday date on the settings page. Why can't we get this done this quarter?"

Look, I'm sorry, we've been over this. It's the design of our back-end. First there's this thing called the Bingo service. See, Bingo knows everyone's name-o, so we get the user's ID out of there. And from Bingo, we can call Papaya and MBS (Magic Baby Service) to get that user ID and turn it into a user session token. We can validate those with LNMOP. And then once we have that we can finally pull the users info down from Raccoon.

Reference: https://www.youtube.com/watch?v=y8OnoxKotPQ

I revisit this video every now and then.

It's all I could think of while reading the article. The spiraling-down into complexity and service names had me smiling all the way. The "we should have predicted this" in last paragraph sealed the deal. And the HN comments saying they should have known better and used ZingBing from the start had me rolling.

"You think you know what it takes to tell the user it's their birthday?!"

This one always cuts too close to home


Can't wait for BallmerCon this year. The XLOOKUP meta is going to spice things up.


Complete with indefatigable acronyms and insane naming conventions.

This guy engineerings.

I had trouble getting through this article because my internal monologue was screaming "Envoy and xDS wouldn't have this problem". But that's exactly what they decided ;) HAProxy is a little behind the state of the art on "hey I could just ask some server where the backends are", and it shows in this case. (The "slots" are particularly alarming, as is having to restart when backends come and go.)

xDS lets you give your frontend proxy a complete view of your whole system -- where the other proxies are (region/AZ/machine) and where the backends are, and how many of those. It can then make very good load balancing decisions -- preferring backends in the AZ that the frontend proxy is in, but intelligently spilling over if some other AZ is missing a frontend proxy or has fewer backends. And it emits metrics for each load balancing decision, so you can detect problems or unexpected balancing decisions before it results in an outage. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overv...

I also like the other features that Envoy has -- it can start distributed traces, it gives every request a unique ID so you can correlate applications and frontend proxy logs, it has a ton of counters/metrics for everything it does, and it can pick apart HTTP to balance requests (rather than TCP connections) between backends. It can also retry failed requests, so that users don't see transient errors (especially during rollouts). And it's retry logic is smart, so that if your requests are failing because a shared backend is down (i.e. your database blew up), it breaks the circuit for a period of time and lets your app potentially recover.

The result is a good experience for end users sending you traffic, and extreme visibility into every incoming request. Mysterious problems become easy to debug just by looking at a dashboard, or perhaps by clicking into your tracing UI in the worst case.

The disadvantage is that it doesn't really support any service discovery other than DNS out of the box. I had to write github.com/jrockway/ekglue to use Kubernetes service discovery to map services to Envoy's "clusters" (upstreams/backends), but I'm glad I did because it works beautifully. Envoy can take advantage of everything that Kubernetes knows about the service, which results in less config to write and a more robust application. (For example, it knows about backends that Kubernetes considers unready -- if all your backends are unready, Envoy will "panic" and try sending them traffic anyway. This can result in less downtime if your readiness check is broken or there's a lot of churn during a big rollout.)

Isn't Ambassador doing the same thing?

Btw not sure if you read till the end, they are actually in the process of migrating to Envoy.

Ambassador is a Envoy control plane, as in it uses Envoy for doing the data proxying, but it sets it up.

So yes, it is :)

Super interesting post. Following blog links, the timeline in https://slack.engineering/all-hands-on-deck-91d6986c3ee also offers a look at the play by play.

However, as far as I can read it, they have somewhat different views on the root cause?

"Soon, it became clear we had stale HAProxy configuration files, as a result of linting errors preventing re-rendering of the configuration."


"The program which synced the host list generated by consul template with the HAProxy server state had a bug. It always attempted to find a slot for new webapp instances before it freed slots taken up by old webapp instances that were no longer running. This program began to fail and exit early because it was unable to find any empty slots, meaning that the running HAProxy instances weren’t getting their state updated. As the day passed and the webapp autoscaling group scaled up and down, the list of backends in the HAProxy state became more and more stale."

Maybe a combination of the two?

The way they are doing things. HAProxy is configured with a fixed amount of slots. This effectively acts as a maximum limit, so should be enough for the running instances + newer instances coming up anytime due to auto scaling.

They have a tool listening to applications starting and shutting down. It's adjusting the configuration live while running to remove shut down instances (free a slot) and put in newer instances (find a free slot and reconfigure).

From the explanation on that day, there were more instances than usual due to high load. It seems the tool was looking for a free slot at some point and found none and crashed.

I'd say, it's an issue with capacity planning because they didn't plan enough slots for their infra on high load and an issue with the tool because it shouldn't fail silently when out of slots.

"640K ought to be enough for anybody!"

Eventually, seemingly sane assumptions become anachronistic laughing points. (Even if they're apocryphal...)

This is not a sane assumption, and he never said it, it's a myth.

I disagree, specifically depending on when that assumption might have been made. My first computer came with 1KB of ram, expandable to 16KB. I have no doubt the designers of that and most of their peers at the time made similar assumptions. My second computer has interesting "bank switching" to circumvent the 16 bit database limitations of it's 8bit cpu that could openly address 64KB, and managed to mostly usefully have 128KB of ram in it. I suspect it's designers would have also happily made an assumption about 640KB being "enough for anyone".

(Also, maybe you should look up the definition of "apocryphal? I know he never said it, and strongly alluded to that, and didn't attribute it to Bill for that reason...)

> they have somewhat different views on the root cause?

It sounds like an issue with naming the failure pattern rather than understanding it. The root cause was equivalent to a memory leak in their custom auto scaling process; machine instances were not being freed (an “instance leak”). The fixed resource limit was self-induced by a hard-coded ratio between the number of proxy servers to web servers.

Historically, the fixed ratio never reached a point where the “instance leak” caused failures but on one specific “Terrible, Horrible, No-Good, Very Bad Day” it failed badly.

Honestly it’s a bit tough for me to parse, but the way I’m reading it,

1. Stale configs led to an overabundance of web apps, and then

2. Old instances of the web app couldn’t be removed because of the consul-template bug.

so, yes, a combination (in sequence) of the two.

Hard for me to be sure because I’m by no means knowledgeable on this stuff.

Even easier way to understand what happened:

- slots full

- to update slots with a new host you need an empty slot

- hosts went away but updating config was impossible -> errors because config referenced non-existing hosts

Agree but one more:

- monitoring was broke so we didn't learn about it until it was too late

Real root cause was a poorly written home built tool to manage haproxy configs. The tool did not handle the slots being full and crashed. Respawn, rinse and repeat. Haproxy config got stale and when their automated tools removed servers it started with the machines haproxy actually knew about, and then the service died.

> ” The reason that we haven’t been doing any significant work on this HAProxy stack is that we’re moving towards Envoy Proxy ”

I truly don’t understand the cycle of churn. Once most edge cases and bugs of Haproxy have been found, the right decision is not migrating to completely unknown territory again. No project is a silver bullet, and changing stacks after you find the bugs makes for a terrible return on your bug-hunting investments.

I think if you read this as they’re moving to Envoy because of this incident then you’ve misread.

But it also sounds like Envoy and HAProxy have fundamentally different approaches to service discovery:



And if there is no path forward to resolve the bugs? How many rough edges and bugs can you tolerate before a system is no longer worth investing in, especially when another newer system does not suffer from any of those bugs? To say there is no tipping point where the cost of continuing to use a system which no longer fits your needs exceeds the cost of migrating to a system which does (and will continue to) fit your needs ignores some amount of common sense.

Great writeup. It's cool that they were able to figure it out as quickly as they did, all things considered.

If I were brought in as a consultant on this, my first question would be: why are you using a fleet of HAProxies instead of the ALB? I'm not saying that's a bad choice, but I'd want to know why that choice was made.

The second question I would ask is what kind of Chaos Engineering they are doing. Are they doing major traffic failover tests? Rapid upscaling tests? Random terminations?

Those are probably the first two things I'd want to solve.

In my experience, ALB scaling is opaque, and constantly lags behind herds of traffic. This is very noticeable if you're running services that experience sudden spikes, there's a noticeable lag of ALB scaling.

That said though, it does do it's job for the most part.

I hope all of these questions would be asked only after everything was working again!

Of course. :) These are things I would ask during a post-mortem.

Interesting bit:

> The program which synced the host list generated by consul template with the HAProxy server state had a bug. It always attempted to find a slot for new webapp instances before it freed slots taken up by old webapp instances that were no longer running. This program began to fail and exit early because it was unable to find any empty slots, meaning that the running HAProxy instances weren’t getting their state updated

I've just been bitten by this too:

  The broken monitoring hadn’t been noticed partly because this system ‘just worked’ for a long time, and didn’t require any change.
Any experience on how to deal with it? Who watches the watchers?

There's a reason the military does drills.

Do them regularly, and keep your playbook of mock failure scenarios up to date with good coverage of all your systems. It's especially critical for disaster recovery (a DR plan that's never tested isn't worth the paper it's written on).

Consider going one step further and randomly injecting artificial failures into production shards, so handling them becomes a regular affair. When you build out monitoring, that might be a good time to think about the kind of stimulus that would be effective at exercising those monitors/metrics/alerts (think unit testing analogy). You can automate an evil adversary bot, or have humans do Red team / Blue team challenges. Yes, you're limited how much you can break without too severely impacting production, but if you've designed enough redundancy you should be able to achieve reasonable coverage. The more you can engineer simulated failures into your regular workflow, the less of a big deal it'll be when real ones occur.

Instead of watching the watchers, engage them with stimulus that keeps them sharp and prevents them from getting bored.

>Do them regularly, and keep your playbook of mock failure scenarios up to date with good coverage of all your systems.

All those man hours cost money. Is it worth it? Depends on how much downtime costs you.

Most companies don't know how much downtime costs them and from their behaviour seem to treat it either as $0/hr or $infinite/hr. To be fair, it's difficult to estimate correctly. Apart from the direct loss of income there is also reputation loss, possibly reduced signups and increased churn, etc. It's not at all constant either, one outage per 10 years will give you a radically different reputation than one outage per week.

It's also working out which outages matter the most. A 10 day outage is not equivalent to 10 1 day outages, the latter has a multitude of important reputational risks - but the former is quite possibly enough to end the business completely.

Depends on how you value the potential outcomes. A while ago, I think it was Gitlab that didn't check their backups and they were a dud.

That caused quite a few problems...How much is it worth to them after the fact? I'd say a fair bit.

But hindsight...

If you run a fairly discretionary web app, like perhaps a game where downtime will cost you lost revenue for the time period but that's it, it's probably not worth it.

If you are running a communications platform like Slack or Gmail, it's hella worth it.

Chaos Engineering comes into play here. Deliberately break your platform to see if everything works as expected.

But how, exactly?

Sure, the chaos monkey could kill haproxy-server-state-management but that wouldn't uncover the bug in question — it'd just demonstrate that without it running HA proxy's view of the world goes stale, which anyone would expect. Triggering the bug would require reducing the number of HA proxy slots below the number of webapps running for many hours. This is clearly something chaos engineering could do but IMO it's highly unlikely anyone would think to do this. If they thought of this they would also have thought about adding tests that caught such as issue long before the code went into production.

In my experience chaos engineering is often only as good as the amount of thought put into the things it does. Killing processes here and there can be useful but it often won't expose the kind of when-the-stars-align issues that take down infrastructures.

It looks like a classic lack of monitoring, as the article says. Alerting on webapps > slots, early exits, or differing views of the number of webapps up would have likely caught this.

> Sure, the chaos monkey could kill haproxy-server-state-management but that wouldn't uncover the bug in question

No it won't. But it would uncover their missing alerts for a critical platform component. Their issue was exacerbated by the fact that state-management kept failing for nearly 12 hours and no one noticed.

Maybe, though again I would expect an alert for the process being dead would also not uncover this particular issue. It's possible someone would notice that there was insufficient alerting then go back to add something which would have caught this but it's far from certain. OTOH it's also possible that at this stage in the game, when the code and systems are EOL, that something like chaos testing is the last chance to catch this problem.

I'm not totally against chaos testing. I just haven't seen it done well and think it's actually pretty hard to pull off (particularly the non-technical aspect of convincing people it's okay to let this thing go mad). I'd love to see how effective it was within Netflix.

> Chaos Engineering

...or just "availability testing" or simply "testing".

Yet, testing systemic failure modes at scale is way more tricky than shutting down some VMs or some network devices.

For example: saturating the uplink bandwidth on a whole datacenter.

To add to that, Netflix has open sourced a tool called Chaos Monkey which randomly terminates production instances so that you can be sure that your overall system is resilient to instance failures.

I don't think that would have ever provoked this error mode, though.

It would have. Their haproxy-server-state-management also comes under the scope of Chaos testing. Which was failing but they never set up alerts for it. So they could have seen that their chaos engine killed this process but they never got notified.

It might have prevented this failure by killing stale HAProxy instances...

To catch this particular issue, they would have had to overprovision webapp instances beyond their N*M limit, take down a subset of the originally provisioned webapp instances and reprovision new ones in order to have observed this particular issue with their HAProxy setup.

It seems like it would take a superhuman level of foresight to catch this scenario for chaos engineering.

This same sort of thing tends to creep into backup systems that work for a long time without if the backups are never restored. The more reliable the systems they backup the bigger the chance that a restore will fail when you need it most. So test your backups.

Yep. I once worked at a company that had backups of all the data. Until they needed a backup and discovered they were not readable. Months of work were gone.

This has happened to me.

Long story. Incompetent (and dishonest) IT person. In fact, I was attacked for “harassing” the IT person.

Nothing was done about it, until the HR DB got borked, and there was no backup.

DR and backup are profoundly unpopular topics. They tend to be expensive, and difficult to test. They also presuppose a Very Bad Thing happening, which no one wants to think about.

I have scars.

I now have multiple layers of backup, and keep an eye on them. I can’t always test every aspect, and have to have faith in cloud providers, but I make sure that, even if we have a tornado, I would still be able to recover the important bits.

there's a joke in the data protection industry.

Backups always work, it's the restores that have problems

Would you consider it evil to use production backups for the QA step in CI/CD? It would catch two birds with one stone, continuously verifying the backups, and ensuring the new code works on real world data.

We don't have any personal information in our production database, but even if we did, as long as the QA is thoroughly prevented from interacting with the outside world, it can't hurt to use production data right?

Big risk of having your privacy sensitive information stored on other infra than the main servers a path to get that information to your QA servers should not even exist.

Many more people that should not have access to that data, in bulk or otherwise suddenly have access which does an end run around all your - hopefully - carefully crafted security.

Keep in mind that what you think isn't privacy sensitive may very well be radioactive when combined with other data.

If you have a procedure for deriving QA data from production, change it to derive from the backup.

Using it directly is full of dangers, not only of leaking information, but also of corrupting your backups. (Otherwise, why are you testing anything?) And deriving test data from production looks like a good thing to me, but make sure to restrict the access to the test environment and mask your data.

You should not use production data for QA, period. Use generated data.

If you really must use real data make sure that (1) you have explicit direction to do this from those higher up in the food chain so that if there ever is a breach of your test systems you don't end up holding the bag and (2) anonymize a copy of the data before it leaves the production system. That way you minimize the risk. But better: simply don't do it, generate your test data.

Make sure you restrict access to the production environment, too. I once worked at a company where the "dev" office was VPN'd into production at all times. We actually didn't have a QA system for most of the time I was there. Anyway, a developer ran his server with the prod config for a couple of days before anyone noticed that requests were being processed elsewhere.

I think it is a big issue to use anything containing PII in anything else than production environment. For example, do your QA logs have the same kind of management, access control and lifetime than production ? If they don't, you might end up logging PII in a way you should not, and could even be in legal troubles for things like GDPR.

Google goes over this scenario in their SRE books. You literally plan to take down your service so you never have 100% uptime, and literally step through a real failure and recovery.

In that process you would go, "Ok, the service is failed and I don't know why. How does the monitoring look?", and then you'd notice it was broken. This is best performed by the person on the team with the least experience, so they ask the most questions, which reveals more.

What's really funny is when the recovery you expected to work doesn't work, and then you have bigger problems... But at least you planned for it :)

What is, according to you, and the SRE book, a good time during the day/week/year to do these real test outages? How much downtime could be ok?

For, let's say, a b2b saas

They notify the customers first? Like, "we'll a little bit sabotage our, well, your, servers this weekend, to find out if they fail and shutdown completely and cannot start again" :-)

It doesn't have to be a customer impact to bring down a service or parts of a service (if it's designed correctly). But even if it was a customer impact, you can do it at the same time you'd schedule regular maintenance (there's always something that needs maintenance, eventually) and throw up a maintenance page.

It's up to the business to define SLAs and SLOs that the customer will be satisfied with.

> do it at the same time you'd schedule regular maintenance ... SLA ... SLO

I'll do that (some day), thanks

If you use something like CloudWatch, you can set an alarm for “insufficient data” for when a monitor stops reporting.

I don't think it would have been enough for this exact situation. The solution you are providing is for cases where x==nil but in their case there was never an x, so you cannot do x==nil checks unless you deliberately set up x.

Not quite. Cloudwatch provides for both behaviors. You can treat periods with no datapoints as “breaching” the threshold criteria. You can also independently alert for “insufficient data” where the source is unavailable, doesnt contain enough data to evaluate, etc.

In the past I did both. Always emit a 0 datapoint every period + treat missing datapoints as breaching to discover if an application wasnt consistently emitting metrics. In addition a lower severity Insufficient Data alert was used to discover/validate when a meteic stream literally didnt exist (normally through misconfiguration of metric & alarm dimensions).

Docs: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

Yep, and Grafana allows alerting on No Data states.

One approach is to have two different monitoring checks that could catch the same root cause.

In this case, one check could be for max age of actually deployed load balancer config, the other could be for new webapp instances getting traffic.

It's unlikely that both monitoring checks fail to trigger, and if only one of them triggers, one should investigate why the other didn't trigger.

Of course, that requires some decent amount of awareness from the on-call engineer (and/or an explicit step for that in the playbook); far from perfect.

Chaos engineering.

You have to randomly simulate failure cases of all shapes and sizes to be prepared for failure cases of all shapes and sizes.

You don't know if your plan will survive contact until it's survived contact.


> Who watches the watchers?

You would need multiple watchers who watch one another. Of course, if all watchers die at the same time, you're out of luck.

This is only part of the solution though, because in order to write and test effective alerting rules, you need to repeatedly and frequently test them with either real or simulated failures, and it can get expensive.

We use Google Cloud Alerting to watch that our alerting works.

I would have used some Heartbeatservice but thats not feasible in our environment.

This is what end to end and integration testing is for.

You would set up an integration test to trigger the state that results in this alert.

haha, curious what the actual context behind that statement is. one thing id ask them is does anyone on that team understand how that system works? also, what's the reason it was broken?

> It’s worth noting that HAProxy can integrate with Consul’s DNS interface, but this adds lag due to the DNS TTL, it limits the ability to use Consul tags, and managing very large DNS responses often seems to lead to hitting painful edge-cases and bugs.

I was surprised how they dismissed HAProxy integration with Consul using SRV DNS records. Can anyone confirm the problems they highlight?

It seems like their service that broke would not be needed if they went the DNS route.

Pre 2.0 there were a few bugs with SRV discovery, maybe they adopted early and got bit? Just an anecdote but we've been using it since 1.9 without issue. Massively different scales though.

Pre k8s and before srv support we used consul template in prod as well but it always scared me, seemed like too many moving pieces for what should've been a simple system.

I asked internally and figured out the gotcha that bit us: default dns payload size is 512b, which is enough for a few backend hosts but for sure not 12 or 30. Limit is 8kb, which probably wouldn't work for whatever slack is doing.


Because DNS records come back in random order for each response, those truncated dns responses caused the backend slots to constantly rotate between different pod instances. Haproxy was graceful about the rotations, but it showed up as suddenly very strange latency / perf numbers when a backend was scaled up to say 10 instances from the normal 3

Dynamic DNS support is only since haproxy 1.8, maybe 2 years old give or take. Slack infrastructure must be older than that so they improvised.

There were few bugs in the first implementations, should be good now. Slack don't mention their scale but I can imagine some UDP/DNS edge cases if there are hundreds of instances behind one domain.

Seems to me that most of the problems came from the sheer scale Slack operates at. A single instance of a self-hosted chat application wouldn't require any of the load balancing infrastructure. But SaaS is really convenient. I wonder if there's a market for "local cloud" companies that operate out of your city and offer hosting of popular open source projects. The complexity would be much lower, and hopefully the reliability higher. Plus you get the benefit of lower latency. 90% of the people in my team's slack channels live with 5 miles of each other.

A single instance of a self-hosted chat application would also result in unbounded hours investigating some obscure bug that affects 1 in 100,000 deployments, etc; while the developers don't know because they don't have enough ability to remotely dig in and investigate.

Self hosting is not a pancea and I would not think it would be more reliable.

There's also limited reason why you really want your servers in your city, 5ms of latency savings isn't it,instead of economies of scale in large datacentres with good network uplinks and centralized reliability teams.

Autoscaling is hard. Never ever use one that you don't thoroughly understand.

An autoscaler that keeps chugging when deploys aren't green is outright dangerous.

I think this one is more of a service discovery bug than auto-scaling.

Perhaps you know more than is in this blog post, but it sounds more like a rather standard load balancer.

Deployment broke. Yeah, that probably should have been caught. Even if it didn't, monitoring should have caught the stale load balancer config. It didn't, for some reason unknown to us.

These things happen. Things break. The autoscaler then proceeded to kill customer traffic. That was the part that worked as designed, so another design would have avoided escalating the situation (if you forgive some armchair engineering here).

Maybe what's missing was to test high load and autoscaling to 5x the traffic on, say, a holiday (when Slacks customers don't work)

> One of the incident’s effects was a significant scale-up of our main webapp tier.

Sorry, I’m not very familiar with the terminology here; what is the “main webapp tier”?

Apps these days might have several groups of services. For a simple case you might have a web tier serving http requests from customers, and you might have a Worker/background task tier. They usually scale independently.

The things that serve the slack gui over the browser.

Another insight for me was coding the original system to have N slots. This should’ve been a red flag—why is there an arbitrary constraint there? Why allow the system to have a fixed limit there?

If you choose to go down the slots road, then you need to put in alerts and discovery for reaching slot limits—-which means monitoring and tracking them, then setting up alerts.

What's the differences of using HAProxy or Envoy between using the cloud load balancers of AWS or Google Cloud?

Cloud load balancers can be sneakily expensive. Few months ago, we spent a few weeks replacing an ELB with naive client side load balancing via round robin, which saves us > 200k/year. ELBs charge per byte transmitted, which seems reasonable, but can end up really expensive.

They charge per byte and per request if I remember well, which can be really expensive for serving both small API call and large files.

Another limitation is that the ELB only works in AWS to AWS instances in the same location. Gotta use something else for geographic load balancing and for other datacenters.

> client side load balancing

In the browser? Or a mobile app?

They send 1 api req to server 1, then 1 to server 2 and so on? What about any session cookies maybe tied to a specific server?

Presumably round-robin DNS. A DNS response would only return a handful of servers, of which the client will itself only pick one at random for the duration of the session.

Now this approach has drawbacks (DNS responses are cached, and the DNS record picked initially by the client will typically be cached until the app/browser is restarted) but if they are acceptable to you then it's an easy, proven solution.

Hmm I'd guess they have DNS cnames like api1.x.com and api2 and 3, 4

And then the client picks one, and if that server is offline, picks another

Seems as simple as DNS based? And works with broken server(s)

Except you don't need that because you can just return all four IP addresses for one record, e.g. api.x.com

I think if such a DNS/ip based round robin server is down, or replies 500 error, the client won't try another server

Unless there's a way to get all ip addrs in js? By custom client code that queries the DNS system?

DNS resolution is handled by the DNS server and the browser. JS isn't involved, it's just telling the browser to connect to a certain hostname and the browser itself decides which IP to map it to (based on its DNS cache).

If the DNS server is down the website wouldn't load at all, but this is an acceptable trade-off considering DNS is a very simple system (not many things can go wrong) and servers can be redundant.

There's a misunderstanding. That reply is to me off topic, I knew about those DNS things already

Thanks anyway for replying

One of the reasons Envoy was built was because ELB/ALBs have opaque observability and fail in ways you can't control.

I've found that the cloud load balancers lag behind the state of the art in features and that their assumptions and configurations can be pretty brittle.

I haven't used Amazon's ALB, but with the legacy ELB, they can't speak ALPN. So that means, if you use their load balancer to terminate TLS, you can't use HTTP/2. Their automatic certificate renewal silently broke for us as well; whereas using cert-manager to renew Let's Encrypt certificates continues to work perfectly wherever I use it. (At the very least, cert-manager produces a lot of logs, and Envoy produces a lot of metrics. So at the very least, when it does break, you know what to fix. With the ELB, we had to pray to the Amazon gods that someone would fix our stuff on Saturday morning when we noticed the breakage. They did! But I don't like the dependency.)

I have also used Amazon's Network Load Balancer with EKS. It interacts very weirdly. The IP address that the load balancer takes on changes with the pods that back the service. The way the changes happen is that the NLB updates a DNS record with a 5 minute TTL. So you have a worst case rollout latency of 5 minutes, and there is no mechanism in Kubernetes to keep the old pods alive until all cached DNS records have expired. The result is, by default, 5 minutes of downtime every time you update a deployment. Less than ideal! For that reason, I stuck with ELB pointing to Envoy that terminated TLS and did all complicated routing.

The ALB wouldn't have these problems. It's just a HTTP/2 server that you can configure using their proprietary and untestable language. It has some weak integration with the Kubernetes Ingress type, so in the simplest of simple cases you can avoid their configuration and use a generic thing. But Ingress misses a lot of things that you want to do with HTTP, so in my opinion it causes more problems than it solves. (The integration is weak too. You can serve your domain on Route 53, but if you add an Ingress rule for "foo.example.com", it's not going to create the DNS record for you. It's very minimum-viable-product. You will be writing custom code on top of it, or be doing a lot of manual work. All in all, going to scale to a large organization poorly unless you write a tool to manage it, in which case you might as well write a tool to configure Envoy or whatever.)

In general, I am exceedingly disappointed by Layer 3 load balancers. For someone that only serves HTTPS, it is completely pointless. You should be able to tell browsers, via DNS, where all of your backends are and what algorithm they should use to select one (if 503, try another one, if connect fails, try another one, etc.) But... browsers can't do that, so you have to pretend that you only have one IP address and make that IP address highly available. Google does very well with their Maglev-based VIPs. Amazon is much less impressive, with one IP address per AZ and a hope and a prayer that the browser does the right thing when one AZ blows up. Since AZs rarely blow up, you'll never really know what happens when it does. (Chrome handles it OK.)

I agree with you that we still have some ways to go with getting LB right, especially WRT to K8S. I think one of the problems is that it seems like every different app is a snowflake with different requirements, so all of these libraries try to be the jack of all trades, leaving the mastery to custom scripts (if it's even obtainable).

For instance: https://github.com/kubernetes-sigs/aws-alb-ingress-controlle...

Also, and you probably already know about this, but it's true that ingress won't create the record automatically for you - but external-dns ( https://github.com/kubernetes-sigs/external-dns ) will - with the correct annotations (pretty simple), external-dns will watch for changes to ingress and publish the dns records on R53 (and many other DNS providers) for you. It works really well for us, even when the subdomain is shared with other infrastructure not managed by itself.

AWS weak point is over-reliance on DNS. Someone needs to evangelize the superiority of non-ttl-based updates, such as K8s endpoint watches.

More control and pricing I guess.

Stupid question, and admittedly off topic:

What's with the "terrible, horrible, no-good, very bad" expression I see a lot? It's a reference to something? From googling, it seems to be this [1], but ... why? Why do people reference it?

Usually you reference some work like this because a) the phrase is unusually creative, or b) the work is unusually memorable. Neither is true here.

[1] https://en.wikipedia.org/wiki/Alexander_and_the_Terrible,_Ho...

This book sold well enough to score a tv series, a disney movie, a musical, and a theater play.

How much more successful do you need it to be? Odds seem very high that a kid growing up in the past 50 years was exposed to this story and phrase

I wasn’t. Or if I was, it obviously didn’t make an impression, and properly so: combining random negative words isn’t creative.

It seems there's some kind of "all your base are belong to us" weird resonance for it. I had my own explanation why it works, but the only proof that's really needed is that it became a meme.

That's just, like, your opinion man.

I immediately recognized the reference. There's no law that says pop culture references need to be to something with X amount of popularity. Chill.

There is a general practice of preferring to repeat stuff that's good, though, or which enhances what it's added to.

A couple points. Just because you personally don't remember or know the book does not mean that it is not memorable. Also, it could be that those who weren't aware of the book used it after seeing it used elsewhere.

It's memorable in that it's a lighthearted way of describing a pretty disastrous event without using expletives, with at least a dash of cultural significance.

I agree it's a lame title.

Congratulations, you're one of the lucky 10,000!

Don't be a sourpuss because you were exposed to a new meme, that's how echo chambers form.

I wasn't exposed to a new meme, I've seen it a hundred times before; the question is why people are so dazzled by it that they keep repeating it when it's not really clever and it doesn't enhance what they're adding it to. It's just a clumsy, extended way of cobbling together negative words to say "very bad".

Those who have read it to insatiable toddlers --- sometimes 5 or 10 readings in a row --- get the reference.

I think they think it's cute and they're attempting to trivialize their loss.

Because sometimes things go wrong and you just want to move to Australia.

Pretty fantastic case study in the perils of complex systems. Is there any place where these types of post-mortems are collected? Could be a very valuable resource for systems engineers.

I dunno, it sounds more like they just got caught with their pants down not finishing a migration to some new shiny.

Also looks like they do blue green but dont confirm the replacements are live before considering the greens new blue.

Scalability is always such a spur of the moment implementation at a startup. This seems to be cruft left over from that startup phase. Would a scalability audit have caught it? Tough to say as Slack came from that build fast and break things era.

Scale fast and break things

Discussion during the outage: https://news.ycombinator.com/item?id=23161623

I bet if this story was told as an animation on YouTube it would be popular.

The title is an allusion to a popular children’s book[1]. I’m assuming that an automated algorithm pulled the “very;” hopefully the mods will consider restoring it.

[1] https://en.wikipedia.org/wiki/Alexander_and_the_Terrible,_Ho...

HN automatically mangles titles in various ways I’m not fond of, dropping words that might or might not be significant, fiddling with capitalisation, &c., but the submitter can go back and edit the title back to what it was supposed to be, and the one time I’ve done that the system didn’t mangulate it again.

That's by design, because the software is obviously imperfect. It does more good than harm, though, so we keep it.

Ok, we've re-veried the title above.

It was better than expected, IMO.

That was the book that made me want to move to Australia!

Or Timbuktu.

Well, did you end up moving to Australia?

This is one of the biggest arguments I see for serverless (AWS Lambda + DynamoDB) or at least managed PaaS systems (Google App Engine, Heroku with RDS or CloudSQL). These systems may seem to cost more for some workload curves (or might even be cheaper for your curve), but the difference is worth it because you're paying for specialized 24/7 dev-ops teams whose only job is to keep these systems running smoothly, and by definition they're already familiar with running workloads orders of magnitude bigger than yours. Even then the platforms might be cheaper because you're only paying for a fraction of the dev-ops team's salaries, but you get their full benefit.

I work for a PaaS.

- The ideal fit for any hosting PaaS is a company who has a large hosting and infra footprint but for whom the technology is _not_ the core competency of the business. Slack is very much better off running their own systems with their own people.

- As someone who deals with customers every day I can tell you that yes - we know our platform specifically and how the internet works generally better than almost 100% of our customers, but we do not know _your application_ at all.

- Many problems in The Cloud are the result of application developers not understanding that there are performance differences between localhost and The Cloud, specifically around IO, or that there are performance differences between the Cloud at 1x and The Cloud at 10x with everything. Systems can run smoothly and applications can still kill themselves because of the interaction between the two.

So this is where 12factor comes in - unless the application is operating at lower than layer 7 (and with some of the newer offerings even layer 3) there’s not much technology centric stuff going on. Everything Slack is doing is happening at the HTTP / gRPC / Websocket level, and it’s hard to make the case for self managed hosting.

The idea of requests going to an application server on a TCP or HTTP connection with the application server being able to access a database or datastore is now common enough for all the PaaS providers to have abstractions for it, with the load balancing and database/data stores being managed. If customers aren’t happy with the auto scaling logic that can be overridden, but it seems like everything else is pretty rock solid.

Slack’s USP isn’t to reinvent load balancing, distributed configuration or database management, so if I was running it and had a clean slate a PaaS seems like a better bet.

If I had To guess, I’d say that Slack’s infrastructure footprint is probably as big if not bigger than any popular PaaS.

Snapchat runs on App Engine (and other stuff runs on App Engine too) so App Engine is almost certainly bigger than Slack just based on that.

That seems... Questionable.

Question on how they measure - I'm in 7 slack workspaces. Do I count 7 times?

Well, I have to create a new account every time I join a new group, so I'd say yes.

Two if my workspaces are tied to the same enterprise account, they should be the same, but who knows how they measure. My base assumption is "in a way giving nicer numbers"

Slack had 12m DAU at the end of last year.

I/O is especially problematic. I've seen people very, very confused about the totally predictable failures of their system components due to burst exhaustion on EBS.

>>"but we do not know _your application_ at all."

You're missing the point. A PaaS, or serverless service doesn't need to know your application. That's the whole point. They're just API calls, and they need to succeed with consistently low latency.

> consistently low latency

That's (a) just one objective and (b) too vaguely stated. What's an acceptable p95 latency for a FaaS? For a database write? For a message queue? The answer is a very big "it depends". And why are we only talking about latency?

There are tradeoffs literally everywhere you look in this space, and not knowing enough about the application's performance, reliability, and efficiency goals can sometimes be a real hindrance to being good at running that application.

> That's the whole point.

Until something falls over and you haven't deployed any new code and all the vendor's systems are green. Part of the reason you went with a PaaS in the first place is that you didn't want to manage the infra yourself, you just want to ship application code.

This has worked great so far and so you've stopped thinking about the infrastructure at all. Disk and network I/O still exist though, you've just been incentivized to stop thinking about them and so all you know is that the vendor sucks, when in reality it's the application's fault.

Somebody's gotta roll up their sleeves and see what's going on in there though, so hopefully your PaaS vendor is as cool as we are :)

You're telling someone who works for a PaaS company what a PaaS does and doesn't do. I think maybe they know.

More broadly, almost any application these days is "just API calls". The issue is which API calls, how many, with what frequency and where. That essentially is the application, and a PaaS employee doesn't magically know that stuff instinctively.

You can't assume that, these costs, specifically the server less stack at AWS, scale up quite opaquely. You can put billing alerts in place, but once you tie your infrastructure to one specific serverless vendor, even if you identify a harmful cost scale, you can't easily mitigate it.

I am not saying serverless is expensive, all I am advocating is extensive planning and preparing before adopting any particular serverless solution. Once you give the green light to adopt a specific cloud solution, you tie yourself to that cloud, and that can turn out to be a bad idea in the long run.

Lambda is a great tool when used right. It can take big workloads without costing too much. But in itself it does very little. It still needs to intergrate with something to be triggered, and if you go with AWS API Gateway that can cost you a pretty penny. Load balancer also incurs costs that are difficult to predict and is not so flexible as other load balancing tools, so sometimes you might still need to provision you own service discovery and load distribution layers, just like they had to do at slack.

Serverless is nice, solves a lot of issues and give smaller teams a shot they otherwise wouldn't have, having to manage everything by themselves. But cloud costs can be opaque, cloud implementation can be very complex, cloud solutions can be too rigid sometimes and tying your product to one vendor can be detrimental in the long run.

Not just costs! AWS services are opaque performance minefields, too.

If you need reliable low latency -- which is probably core to the slack experience, or any GUI in a competitive space, really -- lambda is not a good option. They just don't share your priorities. AWS support will happily waste a lot of your time chasing Just One More Trick to mitigate the problem, though.

Lock-in is one of the worst reasons to avoid serverless. If you're really putting in the effort to avoid lock-in, you're wasting engineering time that could be better spent on your product. AWS (or GCP, or whatever cloud provider you choose) has a number of really great products that ultimately save you time and help you get to market faster and better.

It doesn't matter that you're locked-in, with slightly higher than wanted costs when you've failed due to poor priorities.

Become successful, then worry about removing your lock-in if you actually need it (and you probably won't).

I'd disagree pretty strongly. You may evaluate the tradeoffs and decide to go all-in on a single cloud (and you should do that analysis!), but it's far from a given. You're not wasting engineering time if the costs get out of hand and bankrupt you. Managed services may accelerate your time to market, but they do so at the expense of lock-in and your bill.

Context: I work at a startup that benefits enormously by avoiding AWS/GCP (for most cases) and renting cheap dedicated servers. It is context-dependent; our exact business doesn't benefit much from managed services and really needs big servers.

Just because you're using managed services doesn't mean it's going to be considerably more expensive, when you consider labor costs, and general reliability.

Your service will likely be more reliable if you use DynamoDB or AuroraDB. Your service will probably be more reliable if you build it in a way that assumes nodes will die at any point, will automatically come back in, and can scale up/down. It'll likely be more reliable if you use SQS rather than your own message bus (and let's be honest, it'll probably be cheaper too).

Yes, you should always evaluate the costs, but reliability and the amount of time you're going to spend maintaining something is something that somehow always gets left out of these evaluations.

(Prices based on https://calculator.aws/#/createCalculator and https://www.hetzner.com/dedicated-rootserver?country=us)

An AuroraDB db.r5.xlarge with 10TB of storage, reserved instances 1Y term but no up-front, costs 1,301.40 USD per month.

Take a Hetzner AX161 with 4x3.84 TB SATA SSD, using RAIDZ for 11.52TB usable storage (and 4 times the RAM), at €297.00 per month... so 335.88 USD per month.

That's a difference of 965 $/mo = 11,580 $/yr. If you have 10 of these, they'll pay for a full time sysadmin. Now, that's leaving out a lot of details (bandwith costs and application hosting come to mind), and assumes truly massive databases. On the other hand, as that sysadmin, I promise our databases don't take anything like my full attention, and you really should have some sort of sysadmin/ops team anyways (please do not make devs run your AWS infrastructure; it will end in tears for everyone). Every time this argument comes up, people do mention reliability and time spent on maintenance, but... it's really not bad. Hardware doesn't actually fail that much, postgres isn't that complicated to configure, OS patches aren't that hard to apply. Your mileage will vary, but sometimes it's just not worth using AWS. (And sometimes, it really is; if we didn't need to run oversized databases, I'd push us to use AWS in a heartbeat)

100% this. Cloud vendor lock-in is absolutely a relevant factor for consideration -- especially when the non-cloud alternatives are ~10x cheaper.

At my company we made the decision to stick primarily with managed dedicated servers over AWS in our very early days. Now we're a decent size a few years later (25 employees) and the cost savings we're realizing are tremendous. We did the math and found that if we had gone with AWS in the early days then we would now conservatively be paying an extra $165,000 on our hosting bill annually.

We still use AWS for some specialized services (e.g., Lex) but the bulk of our stack runs on gear that we now colocate for a fraction of the cost.

You are forgetting to price in some minor features that Aurora provides: - Aurora's storage is spread across three availability zones. - Backups. - Automatic failover. - No need to configure anything, it just works.

If your time is free, and you don't actually need anything resembling high availability for the data in the database, then that's a good price comparison. I'm not arguing that managed databases makes sense for everybody, but if you're doing a price comparison then at least factor in multi-site redundancy for the data?

> You are forgetting to price in some minor features that Aurora provides

That's true and fair, although in both directions; skimming the docs it looks like aurora prices include 2 replicas? But backups aren't free (to store), bandwidth isn't free, and iops aren't free. Also, my difficulty in figuring out a fair pricing comparison highlights another point: a dedicated server has a fixed price. Other than more servers for more instances/replicas, you're never going to pay more, and even then it's a simple "adding another replica will increase our costs to X*(N+1) per month", not a "scaling out will add X to our costs, but if we use more I/O than expected we'll add Y to our costs, and exporting data will cost Z in bandwidth".

This matches my experiences entirely. By running our own hardware (mix of Hetzner and colocation) we go from 20% of our revenue spent on server costs to less than 5%. Well worth any potential headaches, which are almost always overstated.

Yeah but why pay someone to look after databases when I could get them do something else that'll have way more value. Using managed services just reduces how much money I have to spend upfront.

It depends™ on your (human and computer) workloads, but in my experience you need ops people to keep everything working smoothly, and once you have them it's easy to have them do this stuff without taking away from other areas. YMMV.

Getting to market faster is fine for an early startup, less so for a public company which needs to justify costs to its shareholders.

Not in the scenario I described. You implemented a solution using lambda and api gateway. Hundreds of microservices later you may realise you could save considerable money going unmanaged. By that time your lock in is costing you an unhealthy amount of money and there is no easy roadmap to mitigate it.

Again, everything must be planned beforehand. Savings could also be marginal, but it could also be significant. In big enterprises, where billing is north of a couple of millions of dollars, every percentile you can save is justifiable.

Not every company is a startup dude.

> because you're paying for specialized 24/7 dev-ops teams whose only job is to keep these systems running smoothly, and by definition they're already familiar with running workloads orders of magnitude bigger than yours

This is based on faith — there might, or might not be a specialized 24/7 devops team who runs these things better than you.

My rational mind has trouble accepting things based on faith, which is also why I don't trust RDS: I don't know of any way to run a distributed SQL database without data loss (neither does Jepsen), so why would I expect RDS to do this correctly?

Using those services does provide a warm and fuzzy feeling, though.

The specialized 24/7 devops team (if it's there) also has a few thousand other customers instead of being there just for you. They might have other priorities at any given moment. It's not like AWS or GCP are renowned for the quality of their customer service.

This is actually a plus point if you ask me. These few thousand customers are all operating on the same racks as me - in an environment like AWS or GCP there's no special part of the datacenter reserved for fancy customers. The billion-dollar customers all run their VMs on the same rack as me. So whatever work the ops teams do to keep things reliable benefits me as much as the biggest customers.

In my experience the ops team will be constantly changing things and upgrading to keep all those customers happy, and they don't really care about your personal risk tolerance. They probably screw up a smaller proportion of the changes but they make an order of magnitude more and you can't predict what impact it will eventually have

Slack's problem though had nothing to do with racks going down.

I'm not convinced Amazon's team is immune from the sort of complex failure mode described here. 'll bet there's people with equivalent sorts of stories about where edge cases in service interactions (either their own set of Lambda services or the AWS ones behind them, or more likely both) lead to a similar unexpected failure cascade.

> I'm not convinced Amazon's team is immune from the sort of complex failure mode

You're being way too kind. Not only is AWS not immune, their autoscalers are often absurdly primitive. Like, hourly cron job doubling / halving within narrow safety rails primitive, where it's not merely possible to find a load that trips it up, it's all but inevitable.

This varies by service, but they always project an image of their infrastructure being rather smart, and in the cases where I've been able to make an informed guess about what's actually going on, it's usually wildly inconsistent with the marketing. They don't warn you about the stinkers and even on services with good autoscaling and no true incompatibility between AWS's hidden choices and your needs, your scaling journey will involve periodic downtime as you trip over hidden built in limits and have to beg support to raise them. Sometimes you get curiously high resistance to this, leading to the impression that these aren't so much "safety rails" as hardcoded choices.

Oh, and just last week we managed to completely wedge a service. The combination of a low limit on in-flight processes, two hung processes, immutability on running processes, and delete functionality being predicated on proper termination led to a situation where an AWS service became completely unusable for days while we begged support to log in and clear the hung process. Naturally, this isn't going to count as downtime on any reliability charts, even though it's a known problem and definitely looked a lot like downtime on our end.

We're a small (<10) team with modest needs. AWS lets us do some crazy awesome things, but it really bugs me how reliably they over-promise and under-deliver.

> You're being way too kind. Not only is AWS not immune, their autoscalers are often absurdly primitive.

Yep. Very much so. Mostly because I don't have enough personal Lambda-specific warstories to feel confident badmouthing it in the context of this discussion thread. But the bits of AWS I do use are certainly not all rainbows and roses...

I have one app/platform I run that basically sits at a few requests an hour for 11 months of the year, then ramps up to well over 100,000 requests a minute between 8am and 11pm for 14 days. Classic ELB (back in the day) needed quite a lot of preemptive poking and fake generated traffic to be able to ramp up capacity fast enough for the beginning of each day (aELB is somewhat better but still needs juggling). We never even got close to getting autoscaling working nicely on the web and app server plane to let it loose in prod with real credit card billing at risk, we just add significantly over provisioned spot instances for our best estimates of yearly growth (and app behaviour changes) for the two weeks instead, and cautiously babysit things for the duration.

It's nice we can do that. It'd be nicer if I didn't have to keep explaining to suits and C*Os why they can't boast own the golf course that they have an autoscaling backend...

Hey, let's not be hasty, it sounds like you've built a bespoke autoscaling backend that intelligently predicts future usage and dynamically allocates compute resources to match customer needs.

AWS is pretty good whenever we've needed them. Google? Probably not.

Google swears gcloud support is good, of course, but I've never actually used it -- but then again, I've never actually needed it.

Meanwhile, I need AWS support constantly because their entire platform is a gigantic social experiment in minimum viable products. How crusty are people willing to tolerate? Evidently: very, very crusty.

We don't use most of the platform but the stuff we do use is rock solid (VPC, EC2, RDS, Route53, S3 etc). I do sometimes wonder what the point is for half the stuff they release.

What I do know about GCP is they had a production bug in their tooling that was breaking everything for literally 100s of customers and they never even bothered replying to the bug report on their support forums. That experience along with the general modus operandi of shutting things down that don't further their surveillance capitalism business model means I won't be trying them again.

> I do sometimes wonder what the point is for half the stuff they release.

I'm convinced they mvp every possible idea because it makes their platform more sticky. The more services you use, the harder it is to leave them for something better. The problem with that is you get 100 half dead zombie services and it's really hard to know which services are actually supported and which aren't.

It's the Amazon equivalent of Google creating 10 different messenger applications only Amazon never kills an old service, they just let it rot forever.

It almost seems to me (as an outsider) that Amazon has a policy that any tool they build internally or for a specific customer also has to be offered to the public on AWS. Like their managed satellite service or any number of things that make you say “I’ll bet some critical AWS thing was the first customer of this hyper-specific product”.

AFAIK this is pretty spot on. I don't know if it's "policy" or what, but based on talking with AWS account managers, AWS does not often (or ever) create custom AWS services/products for individual customers. They do however let customers submit feature requests, and then let customers lean heavily on AWS to implement those feature requests in the public AWS services.

I think this is what leads to a lot of MVP-type services. A large customer clamors for some individual niche feature, AWS implements a MVP version of it, and then the team moves on to the next feature request (which might be for a completely different service, leaving the MVP in a perpetual MVP state).

Vice versa for us. Google always has troubles and outages and AWS worked fine. Not sure why.

I launched a large AAA game backed by a hybrid (GCE+Physical) infrastructure and I would swear by their support.

Yes, we paid for enterprise support, but it's especially good. They even contributed code to third-party open source projects to solve one of our bugs.

I hear often about google's support being terrible, but the enterprise support on google cloud is definitely an exception.

F.D: Aside from giving a talk at google stockholm once, I am not affiliated in any way.

Could you describe what you mean by a distributed SQL database?

I don’t think RDS would generally fit that concept as I understand it. Aurora’s data store possibly but you choose to use that specifically.

Though they do have good availability, they are certainly not infallible and have been down for hours or even days. And, when that happens, all you can do is pray they get to it soon. You have no control over when they will make riskier changes or how fast they will be able to respond. If they fail to respect the SLA, the maximum they are going to do is giving your money back for your services.

Obviously, there are good aspects of outsourcing devops/admin work. It's a tradeoff, as most things. If you are a struggling startup it's difficult to justify the cost of owning a bunch of hardware and hiring expensive infra people to manage it. However, facebook is probably better off owning their own infraestructure.

This is an underrated point. The problem with AWS and those massive services is that they are "alive" and continuously evolving. In most cases it's fine and you don't notice, but when it goes wrong it can affect you and you have no control over when that is likely to happen.

In contrast with in-house infrastructure, you can make your stack as simple or as complex as you'd like depending on your needs (a lot of projects can get away with a handful of physical machines all configured manually, no Terraform/Kubernetes/etc) and you control when you make drastic changes that risk breaking things so you can plan them during a time when downtime would be the least damaging to your business.

FWIW, that time AWS had a massive massive failure it was actually really convenient being on AWS as so much of the Internet was down at the same time that pretty much no one was upset at me for my one service also being offline.

I think it depends on your competitors and what your software does. If your competitors are still running their services but not you, you're bound to lose some customers. Also, regardless of that, you can easily be losing money while the service is down. For instance, if you have a food delivery app and it's down, while your competitor is up, people will just use the other app and you're going to lose money while AWS is down. Even worse, they may actually end up liking the competitor's app better, and you can even lose customers that way.

However, if you're looking from the point of view of an employee trying to justify to your boss why the service is down. I'm sure they'll be more understanding that's not your fault in this case. It's a risk they decided to take.

What? You can totally run disturbed SQL without data loss. Did you mean to imply something in particular?

CAP theorem strongly suggests if you optimise against data loss you have to trade off something else which is probably important too...

'Disturbed' SQL confused me, had to look up the thread to see Distributed SQL. And yes, I agree, you certainly can run distributed without data loss, depending on the platform design and your 'transaction level'.

You think you can run something like Slack without intimate knowledge of all the moving parts?

The only thing that 24/7 crack devops team ensures in a situation like this is that you continue to generate billable workload, even if it's just millions of little serverless pods spinning in a circle from some configuration mishap.

> you get their full benefit

and their full downside.

what, prey tell, downside?

they operate at a scale (since you mentioned it) much larger than yours. you don't benefit from scale beyond what you need, you "only" benefit from the SLA.

you don't get to tell them what to do, to set their priorities on features vs bugs vs performance vs meaningless metric of the day.

you don't get to interact consistently with the same person or same set of staff, to understand their foibles and to nudge effectively.

you don't get to decide what features are critical to you and cannot, ever, ever be cut no matter how otherwise impactful they are on the environment.

you don't get to set the timetable for "events".

it's not the absolute no-brainer you are making it out to be.

that said, i agree that the value is solidly there for those in the fat part of the bell curve.

CloudSQL just took down our DB for the hell of it once. Must've cycled the node or something.

Replace "serverless" with "managed service" and I agree. But serverless doesn't give you any magical reliability improvement over anything else, because your NFRs dictate the implementation that will produce the most reliable product.

The design of a wooden table will inform what tools will produce the best version of that table. It may not be the newest power tool; it may end up being chisels, hand planes, winding sticks, a kerf saw and a mallet. If somebody maintains your tools for you they'll stay reliably sharp, but that doesn't lead to a good table unless you pick the right tools and use them the right way.

I prefer to refer to NFRs as Operational Requirements. Cloud vendors are entirely in the job of satisfying operational requirements.

I’ve anecdotally experienced success doing this; the only downside is cost. In many cases you end up throwing money/hardware at a problem, but having used PaaS for high scale apps where money was no issue was like life on easy mode.

Why was the “very” removed? It sounds so much better like that.

After I used Discord in different contexts for months now (and Slack for years), I can't understand why someone willingly chooses Slack.

It's the Atlassian of chat tools. Horrible performance and bad usabillity.

As someone who uses Discord, I can tell you it goes down too. My favorite part is looking at their status page, seeing that API response time is exceedingly high, and getting no updates from the team about whether or not they're fixing it.

I'm not saying Discord is perfect or "always up", it's just that their client UI is better structured and more responsive in most cases.

I am on multiple OSS Discord servers with thousands of users, and it works just fine most of the time.

I am on multiple Slack servers with just 10-20 users and it is unbearably slow.

I admined a multi-hundred user Slack org and now am part of a multi-thousand user slack org and I've never experienced (or heard about) the slowness that you're describing. Do you have a feel for whether it was client-side or server-side?

More client-side.

Server-side scaling was only an issue with OSS communities and for hackathons, where >50k people were online.

I think Discord's Achilles' heel for enterprise is its UI unfortunately.

There's no way that my company would adopt a platform so "fun" in the way Discord tries to be. Animated characters, a logo which looks like a gamepad, etc...

It sucks, because I use Slack for work and mostly Discord for personal use (mostly dev communities for different companies), and Discord is far and away a better experience. If Discord provided a "cleaned-up" offering for orgs like mine (finance), I think it would be an instant hit.

I log on to Slack at work and see pages of emoji, animated gifs, and animated reactions. What’s the difference?

Discord is actually actively trying to become less gamer-focused, but it's unclear if this will make them as enterprise as Slack or Teams is.


Yeah the performance of Discord is much better but it looks and feels like a product for gamers.

Slack has a more mainstream look and feel which is probably one of the reasons it's preferred by companies.

What are some open source discord servers you are a part of? I would like to know about some open source communities if you do not mind.

I'm on the React and Rust servers.

I use Zulip for the day to day (it's amazing, I can't recommend it enough), but sometimes use Slack because some open source communities use it, and I'm always amazed at how damn slow it is. I can consistently out-type it, it's terrible.

I guess it was great when it started out, but they're slowly boiling the frog, who is us.

Zulip won my last bake-off for chat systems. Integration was easy and the topic method of providing threads was amazing. The only feature it was missing was federation. In the XMPP world, you could communicate with users on other XMPP instances. With Zulip, you can only communicate with local users. Do you know if this is still the case?

As other folks have mentioned, Zulip has a number of cross-server integrations with both the Zulip protocol and other protocols like XMPP. There's a few we document here as well as Matterbridge:

* https://zulipchat.com/integrations/communication * https://github.com/42wim/matterbridge

We'll eventually add a more native way of connecting a stream between two Zulip servers; we just want to be sure we do that right; federation done sloppily is asking for a lot of spam/abuse problems down the line.

(I'm the Zulip lead developer)

Zulip has a cross server bridge using a bot that you might be interested in:


Yes, I doubt federation will ever be added.

That's one thing that makes me quite sad, to see open source communities adopt a closed chat tool. I know that it's easy and zero-maintenance for them, but I can't help but feel there are better options.

There certainly are, Zulip is OSS and so is Matrix.

Well, Hipchat would be the literal Atlassian of chat tools.

Ah yes, they have their own thing now :D

They've actually been shutting it down for a while after Slack bought it from them to do just that.

HipChat was sold from Atlassian to Slack.

Oh .... okay?

Because it's the Atlassian of chat tools. It's the lowest common denominator.

Discord also doesn't let you set message retention that I'm aware of which is an immediate nonstarter.

End up in one lawsuit where the other party demands a fishing expedition and you'll be real happy that you've got retention limited to 90 days by policy and in practice.

If it's legal to set this to 90 days, why can't a firm set it to one or zero days?

You could, and I'm sure some places have a business case for this.

Most need stuff going back at least a little bit. If I was talking to you about something on Friday and wanted to reference the conversation Monday I'd be real mad if the convo was already deleted.

I think I'm confused... there are complaints above that Discord doesn't save transcripts. That seems equivalent to zero-day retention? I haven't used Discord so I'm probably misunderstanding something.

If I scroll back in discord I can go back to the first message I ever sent to the servers I hang out on.

In slack you can configure a retention period for files, messages, or both. Anything older than that gets wiped.

Saved chat history is the big one. Discord is not suitable for business comms without it.

Other than that:

- shared channels between workspaces

- threads

- private messages within a workspace, instead of globally

- decoupled accounts from workspaces, so I can use my personal email and work email associated slack workspaces at the same time

- much tighter integration with third party tools eg zoom, webex, etc

Discord is great for casual chats with friends or open source communities. Insufficient for business.

How come? Me phoning or talking with a customer/co-worker directly is also "suitable for business".

I think this is why we often get that "this call may be recorded for training purposes" message while waiting in the call center queue? Even if they're not actually keeping all those recordings, they can pass the legal test by saying that they do.

Chat history is pretty important for your HR and general counsel if there's instances of misconduct or abuse. Discord doesn't support the same level of chat history/logging that businesses (of all sizes) care about.

I know.

I just don't understand why it's important for text based communication, but not voice.

Threaded comments. I tried using Discord with a group of 4 people for real job and we missed threaded comments.

Also the ability to draw on screen while screen sharing. So simple yet so useful.


I personally didn't like threads and this is the first time someone mentions them favorably.

Funny that I found it easy to see your comment in HackerNews because it was separated in ... a thread!

Imagine how HackerNews, Reddit or any other discussion board would be without threading. Now, if you see a #general Slack room of a 100 person company, you'll quickly see that it would be a mess without threading. That's what happens in Discord.

Now that even iMessage supports threaded discussions (in iOS 14), I think it would be great if Discord added this too. They’re very nice when there are multiple overlapping conversations and you don’t want to break off into two separate channels.

Since you’re a skeptic, I’d love to hear your thoughts on my reply to xtracto!

I missed them in discord just yesterday

Shameless plug for my side project: XpanXn.com

The central value proposition is its reimagined UI, making threaded replies more visible and more interactive by separating conversation topics (threads) into different columns.

It’s currently just an MVP, but I expect to at least improve the UI and maybe rearchitect the system in the near future.

Any feedback is greatly appreciated!

We have been using self-hosted Mattermost for years and I can highly recommend it. It never crashes, performance is great, and upgrading it is very easy to do.

The big requirements for enterprise use are: single-sign on, compliance with regulations (e.g. a multinational company needs to know that you are in compliance with the laws around data retention, data locale, etc. for their own data, in every country that it operates in), and API support for eDiscovery.

Because Discord is very gaming oriented. But yes, Discord is great.

Actually, they’re pivoting to be a social platform for non-gamers, too. Just yesterday I got a notification that they’re pushing for more non-gamers to create their communities on Discord.

Oh no poor HipChat. (Atlassian's chat tool)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact