
A terrible, horrible, no-good, very bad day at Slack - ceohockey60
https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-slack-dfe05b485f82
======
m12k
TL;DR First a performance bug was caught during rollout, and rolled back
within a few minutes. However this triggered their auto-scaling of web apps to
ramp up to more instances than a hard limit they had. This in turn triggered a
bug in how they update the list of hosts in their load balancer, causing it to
not get updated with new instances, and eventually go stale. After 8 hours the
only real remaining instances in the list were the few oldest ones, so when
they then scaled down the number of instances again, these old instances were
the first to be shut down, causing the outage, because the instances that
should have taken over were not listed in the stale load balancer host list.

~~~
vondur
It almost seems like no one group understands the system as a whole, so when
one part fails, no one has a clear idea of the domino effects that can happen.
I'm guessing this is the result of really complex systems interactions.

~~~
Nextgrid
Which is what always happens when the company aims to build an "engineering
playground" full of microservices and other moving parts without a clear
technical justification and balancing the pros & cons and why I personally
don't like working on such projects - it makes me feel uneasy not having a
good understanding of the entire system.

To be fair to Slack, at their scale, lots of moving parts might make sense,
but I see a lot of companies (including startups with very few customers)
going down the microservices route and exposing themselves to such a risk when
there is no major upside beyond giving engineers lots of toys to play with and
slapping the "microservices" and related buzzwords on their careers page.

~~~
bstar77
I think you are drawing the wrong conclusions here... Microservices is not the
boogyman here. It more likely has to do with speed of development, developer
turnover and a plethora of other things that result in insufficient knowledge
transfer.

Microservices (like just about anything) can be implemented well or poorly.
There's a reason we have sophisticated orchestration solutions like
Kubernetes... it exists to tame large scale deployments that has sensible
failover processes.

The benefits you get are services that can be scaled independently,
deployments that only affect isolated pieces of code, horizontal scaling,
dockerized environemnts, etc. All of these advantages should exist in well
designed systems, but systems that have been executed hastily will likely have
critical problems crop up at some point.

~~~
Wheaties466
I like your statement but I don't think it has to do with the speed of
development either.

In a monolithic architecture the devs that deal with it, have to deal with the
program as a whole. So if something doesn't work, its their problem. Where as
in a micro service architecture it can be easy to spin up a service and not
know the systems that integrate with it.

The problem here is with documentation and understanding of architecture. Its
just the nature of the beast that the monolithic dev knows how thing
communicate with the monolithic program because he needs to know, in order to
do his job. In this instance the problem isn't with micro services, its with
the execution. And that execution is a very easy trap to fall into with micro
services.

~~~
thebean11
The same thing can happen in a monolith. Make a change, run tests, and make
sure your new feature works, while breaking some untested behavior you didn't
know about in some other part of the monolith.

------
matiasfernandez
"I'm still not understanding why it's so hard to display the birthday date on
the settings page. Why can't we get this done this quarter?"

Look, I'm sorry, we've been over this. It's the design of our back-end. First
there's this thing called the Bingo service. See, Bingo knows everyone's
name-o, so we get the user's ID out of there. And from Bingo, we can call
Papaya and MBS (Magic Baby Service) to get that user ID and turn it into a
user session token. We can validate those with LNMOP. And then once we have
that we can finally pull the users info down from Raccoon.

~~~
fireflux_
Reference:
[https://www.youtube.com/watch?v=y8OnoxKotPQ](https://www.youtube.com/watch?v=y8OnoxKotPQ)

I revisit this video every now and then.

~~~
celim307
This one always cuts too close to home

[https://youtu.be/_o7qjN3KF8U](https://youtu.be/_o7qjN3KF8U)

~~~
matiasfernandez
Can't wait for BallmerCon this year. The XLOOKUP meta is going to spice things
up.

[https://www.youtube.com/watch?v=xubbVvKbUfY](https://www.youtube.com/watch?v=xubbVvKbUfY)

------
jrockway
I had trouble getting through this article because my internal monologue was
screaming "Envoy and xDS wouldn't have this problem". But that's exactly what
they decided ;) HAProxy is a little behind the state of the art on "hey I
could just ask some server where the backends are", and it shows in this case.
(The "slots" are particularly alarming, as is having to restart when backends
come and go.)

xDS lets you give your frontend proxy a complete view of your whole system --
where the other proxies are (region/AZ/machine) and where the backends are,
and how many of those. It can then make very good load balancing decisions --
preferring backends in the AZ that the frontend proxy is in, but intelligently
spilling over if some other AZ is missing a frontend proxy or has fewer
backends. And it emits metrics for each load balancing decision, so you can
detect problems or unexpected balancing decisions before it results in an
outage.
[https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overv...](https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/zone_aware)

I also like the other features that Envoy has -- it can start distributed
traces, it gives every request a unique ID so you can correlate applications
and frontend proxy logs, it has a ton of counters/metrics for everything it
does, and it can pick apart HTTP to balance requests (rather than TCP
connections) between backends. It can also retry failed requests, so that
users don't see transient errors (especially during rollouts). And it's retry
logic is smart, so that if your requests are failing because a shared backend
is down (i.e. your database blew up), it breaks the circuit for a period of
time and lets your app potentially recover.

The result is a good experience for end users sending you traffic, and extreme
visibility into every incoming request. Mysterious problems become easy to
debug just by looking at a dashboard, or perhaps by clicking into your tracing
UI in the worst case.

The disadvantage is that it doesn't really support any service discovery other
than DNS out of the box. I had to write github.com/jrockway/ekglue to use
Kubernetes service discovery to map services to Envoy's "clusters"
(upstreams/backends), but I'm glad I did because it works beautifully. Envoy
can take advantage of everything that Kubernetes knows about the service,
which results in less config to write and a more robust application. (For
example, it knows about backends that Kubernetes considers unready -- if all
your backends are unready, Envoy will "panic" and try sending them traffic
anyway. This can result in less downtime if your readiness check is broken or
there's a lot of churn during a big rollout.)

~~~
harpratap
Isn't Ambassador doing the same thing?

Btw not sure if you read till the end, they are actually in the process of
migrating to Envoy.

~~~
edude03
Ambassador is a Envoy control plane, as in it uses Envoy for doing the data
proxying, but it sets it up.

So yes, it is :)

------
nik_0_0
Super interesting post. Following blog links, the timeline in
[https://slack.engineering/all-hands-on-
deck-91d6986c3ee](https://slack.engineering/all-hands-on-deck-91d6986c3ee)
also offers a look at the play by play.

However, as far as I can read it, they have somewhat different views on the
root cause?

"Soon, it became clear we had stale HAProxy configuration files, as a result
of linting errors preventing re-rendering of the configuration."

vs.

"The program which synced the host list generated by consul template with the
HAProxy server state had a bug. It always attempted to find a slot for new
webapp instances before it freed slots taken up by old webapp instances that
were no longer running. This program began to fail and exit early because it
was unable to find any empty slots, meaning that the running HAProxy instances
weren’t getting their state updated. As the day passed and the webapp
autoscaling group scaled up and down, the list of backends in the HAProxy
state became more and more stale."

Maybe a combination of the two?

~~~
ketzo
Honestly it’s a bit tough for me to parse, but the way I’m reading it,

1\. Stale configs led to an overabundance of web apps, and _then_

2\. Old instances of the web app couldn’t be removed because of the consul-
template bug.

so, yes, a combination (in sequence) of the two.

Hard for me to be sure because I’m by no means knowledgeable on this stuff.

~~~
aeyes
Even easier way to understand what happened:

\- slots full

\- to update slots with a new host you need an empty slot

\- hosts went away but updating config was impossible -> errors because config
referenced non-existing hosts

~~~
folkhack
Agree but one more:

\- monitoring was broke so we didn't learn about it until it was too late

------
nikanj
> ” The reason that we haven’t been doing any significant work on this HAProxy
> stack is that we’re moving towards Envoy Proxy ”

I truly don’t understand the cycle of churn. Once most edge cases and bugs of
Haproxy have been found, the right decision is not migrating to completely
unknown territory again. No project is a silver bullet, and changing stacks
after you find the bugs makes for a terrible return on your bug-hunting
investments.

~~~
brown9-2
I think if you read this as they’re moving to Envoy _because_ of this incident
then you’ve misread.

But it also sounds like Envoy and HAProxy have fundamentally different
approaches to service discovery:

[https://twitter.com/mattklein123/status/1277729102271676416?...](https://twitter.com/mattklein123/status/1277729102271676416?s=21)

[https://twitter.com/mattklein123/status/1278114953497436161?...](https://twitter.com/mattklein123/status/1278114953497436161?s=21)

------
jedberg
Great writeup. It's cool that they were able to figure it out as quickly as
they did, all things considered.

If I were brought in as a consultant on this, my first question would be: why
are you using a fleet of HAProxies instead of the ALB? I'm not saying that's a
bad choice, but I'd want to know why that choice was made.

The second question I would ask is what kind of Chaos Engineering they are
doing. Are they doing major traffic failover tests? Rapid upscaling tests?
Random terminations?

Those are probably the first two things I'd want to solve.

~~~
rhizome
I hope all of these questions would be asked only after everything was working
again!

~~~
jedberg
Of course. :) These are things I would ask during a post-mortem.

------
fredsted
Interesting bit:

> The program which synced the host list generated by consul template with the
> HAProxy server state had a bug. It always attempted to find a slot for new
> webapp instances before it freed slots taken up by old webapp instances that
> were no longer running. This program began to fail and exit early because it
> was unable to find any empty slots, meaning that the running HAProxy
> instances weren’t getting their state updated

------
hyperman1
I've just been bitten by this too:

    
    
      The broken monitoring hadn’t been noticed partly because this system ‘just worked’ for a long time, and didn’t require any change.
    

Any experience on how to deal with it? Who watches the watchers?

~~~
harpratap
Chaos Engineering comes into play here. Deliberately break your platform to
see if everything works as expected.

~~~
cranekam
But how, exactly?

Sure, the chaos monkey could kill haproxy-server-state-management but that
wouldn't uncover the bug in question — it'd just demonstrate that without it
running HA proxy's view of the world goes stale, which anyone would expect.
Triggering the bug would require reducing the number of HA proxy slots below
the number of webapps running for many hours. This is clearly something chaos
engineering _could_ do but IMO it's highly unlikely anyone would think to do
this. If they thought of this they would also have thought about adding tests
that caught such as issue long before the code went into production.

In my experience chaos engineering is often only as good as the amount of
thought put into the things it does. Killing processes here and there _can_ be
useful but it often won't expose the kind of when-the-stars-align issues that
take down infrastructures.

It looks like a classic lack of monitoring, as the article says. Alerting on
webapps > slots, early exits, or differing views of the number of webapps up
would have likely caught this.

~~~
harpratap
> Sure, the chaos monkey could kill haproxy-server-state-management but that
> wouldn't uncover the bug in question

No it won't. But it would uncover their missing alerts for a critical platform
component. Their issue was exacerbated by the fact that state-management kept
failing for nearly 12 hours and no one noticed.

~~~
cranekam
Maybe, though again I would expect an alert for the process being dead would
also not uncover this particular issue. It's possible someone would notice
that there was insufficient alerting then go back to add something which would
have caught this but it's far from certain. OTOH it's also possible that at
this stage in the game, when the code and systems are EOL, that something like
chaos testing is the last chance to catch this problem.

I'm not totally against chaos testing. I just haven't seen it done well and
think it's actually pretty hard to pull off (particularly the non-technical
aspect of convincing people it's okay to let this thing go mad). I'd love to
see how effective it was within Netflix.

------
MapleWalnut
> It’s worth noting that HAProxy can integrate with Consul’s DNS interface,
> but this adds lag due to the DNS TTL, it limits the ability to use Consul
> tags, and managing very large DNS responses often seems to lead to hitting
> painful edge-cases and bugs.

I was surprised how they dismissed HAProxy integration with Consul using SRV
DNS records. Can anyone confirm the problems they highlight?

It seems like their service that broke would not be needed if they went the
DNS route.

~~~
blyry
Pre 2.0 there were a few bugs with SRV discovery, maybe they adopted early and
got bit? Just an anecdote but we've been using it since 1.9 without issue.
Massively different scales though.

Pre k8s and before srv support we used consul template in prod as well but it
always scared me, seemed like too many moving pieces for what should've been a
simple system.

~~~
blyry
I asked internally and figured out the gotcha that bit us: default dns payload
size is 512b, which is enough for a few backend hosts but for sure not 12 or
30. Limit is 8kb, which probably wouldn't work for whatever slack is doing.

[https://cbonte.github.io/haproxy-
dconv/2.1/configuration.htm...](https://cbonte.github.io/haproxy-
dconv/2.1/configuration.html#5.3.2-accepted_payload_size).

Because DNS records come back in random order for each response, those
truncated dns responses caused the backend slots to constantly rotate between
different pod instances. Haproxy was graceful about the rotations, but it
showed up as suddenly very strange latency / perf numbers when a backend was
scaled up to say 10 instances from the normal 3

------
anderspitman
Seems to me that most of the problems came from the sheer scale Slack operates
at. A single instance of a self-hosted chat application wouldn't require any
of the load balancing infrastructure. But SaaS is really convenient. I wonder
if there's a market for "local cloud" companies that operate out of your city
and offer hosting of popular open source projects. The complexity would be
much lower, and hopefully the reliability higher. Plus you get the benefit of
lower latency. 90% of the people in my team's slack channels live with 5 miles
of each other.

~~~
dannyw
A single instance of a self-hosted chat application would also result in
unbounded hours investigating some obscure bug that affects 1 in 100,000
deployments, etc; while the developers don't know because they don't have
enough ability to remotely dig in and investigate.

Self hosting is not a pancea and I would not think it would be more reliable.

There's also limited reason why you really want your servers in your city, 5ms
of latency savings isn't it,instead of economies of scale in large datacentres
with good network uplinks and centralized reliability teams.

------
xorcist
Autoscaling is hard. Never ever use one that you don't thoroughly understand.

An autoscaler that keeps chugging when deploys aren't green is outright
dangerous.

~~~
harpratap
I think this one is more of a service discovery bug than auto-scaling.

~~~
xorcist
Perhaps you know more than is in this blog post, but it sounds more like a
rather standard load balancer.

Deployment broke. Yeah, that probably should have been caught. Even if it
didn't, monitoring should have caught the stale load balancer config. It
didn't, for some reason unknown to us.

These things happen. Things break. The autoscaler then proceeded to kill
customer traffic. That was the part that worked as designed, so another design
would have avoided escalating the situation (if you forgive some armchair
engineering here).

~~~
cutemonster
Maybe what's missing was to test high load and autoscaling to 5x the traffic
on, say, a holiday (when Slacks customers don't work)

------
saagarjha
> One of the incident’s effects was a significant scale-up of our main webapp
> tier.

Sorry, I’m not very familiar with the terminology here; what is the “main
webapp tier”?

~~~
jsmeaton
Apps these days might have several groups of services. For a simple case you
might have a web tier serving http requests from customers, and you might have
a Worker/background task tier. They usually scale independently.

------
astral303
Another insight for me was coding the original system to have N slots. This
should’ve been a red flag—why is there an arbitrary constraint there? Why
allow the system to have a fixed limit there?

If you choose to go down the slots road, then you need to put in alerts and
discovery for reaching slot limits—-which means monitoring and tracking them,
then setting up alerts.

------
vecio
What's the differences of using HAProxy or Envoy between using the cloud load
balancers of AWS or Google Cloud?

~~~
ublaze
Cloud load balancers can be sneakily expensive. Few months ago, we spent a few
weeks replacing an ELB with naive client side load balancing via round robin,
which saves us > 200k/year. ELBs charge per byte transmitted, which seems
reasonable, but can end up really expensive.

~~~
cutemonster
> client side load balancing

In the browser? Or a mobile app?

They send 1 api req to server 1, then 1 to server 2 and so on? What about any
session cookies maybe tied to a specific server?

~~~
Nextgrid
Presumably round-robin DNS. A DNS response would only return a handful of
servers, of which the client will itself only pick one at random for the
duration of the session.

Now this approach has drawbacks (DNS responses are cached, and the DNS record
picked initially by the client will typically be cached until the app/browser
is restarted) but if they are acceptable to you then it's an easy, proven
solution.

~~~
cutemonster
Hmm I'd guess they have DNS cnames like api1.x.com and api2 and 3, 4

And then the client picks one, and if that server is offline, picks another

Seems as simple as DNS based? And works with broken server(s)

~~~
boring_twenties
Except you don't need that because you can just return all four IP addresses
for one record, e.g. api.x.com

~~~
cutemonster
I think if such a DNS/ip based round robin server is down, or replies 500
error, the client won't try another server

Unless there's a way to get all ip addrs in js? By custom client code that
queries the DNS system?

~~~
Nextgrid
DNS resolution is handled by the DNS server and the browser. JS isn't
involved, it's just telling the browser to connect to a certain hostname and
the browser itself decides which IP to map it to (based on its DNS cache).

If the DNS server is down the website wouldn't load at all, but this is an
acceptable trade-off considering DNS is a very simple system (not many things
can go wrong) and servers can be redundant.

~~~
cutemonster
There's a misunderstanding. That reply is to me off topic, I knew about those
DNS things already

Thanks anyway for replying

------
SilasX
Stupid question, and admittedly off topic:

What's with the "terrible, horrible, no-good, very bad" expression I see a
lot? It's a reference to something? From googling, it seems to be this [1],
but ... _why_? Why do people reference it?

Usually you reference some work like this because a) the phrase is unusually
creative, or b) the work is unusually memorable. Neither is true here.

[1]
[https://en.wikipedia.org/wiki/Alexander_and_the_Terrible,_Ho...](https://en.wikipedia.org/wiki/Alexander_and_the_Terrible,_Horrible,_No_Good,_Very_Bad_Day)

~~~
nikanj
This book sold well enough to score a tv series, a disney movie, a musical,
and a theater play.

How much more successful do you need it to be? Odds seem very high that a kid
growing up in the past 50 years was exposed to this story and phrase

~~~
SilasX
I wasn’t. Or if I was, it obviously didn’t make an impression, and properly
so: combining random negative words isn’t creative.

~~~
narag
It seems there's some kind of "all your base are belong to us" weird resonance
for it. I had my own explanation why it works, but the only proof that's
really needed is that it became a meme.

------
realtalk_sp
Pretty fantastic case study in the perils of complex systems. Is there any
place where these types of post-mortems are collected? Could be a very
valuable resource for systems engineers.

~~~
marksomnian
Here's a decent list: [https://github.com/danluu/post-
mortems](https://github.com/danluu/post-mortems)

------
sfpoet
Scalability is always such a spur of the moment implementation at a startup.
This seems to be cruft left over from that startup phase. Would a scalability
audit have caught it? Tough to say as Slack came from that build fast and
break things era.

~~~
cutemonster
Scale fast and break things

------
dpix
Discussion during the outage:
[https://news.ycombinator.com/item?id=23161623](https://news.ycombinator.com/item?id=23161623)

------
wayanon
I bet if this story was told as an animation on YouTube it would be popular.

------
kd5bjo
The title is an allusion to a popular children’s book[1]. I’m assuming that an
automated algorithm pulled the “very;” hopefully the mods will consider
restoring it.

[1]
[https://en.wikipedia.org/wiki/Alexander_and_the_Terrible,_Ho...](https://en.wikipedia.org/wiki/Alexander_and_the_Terrible,_Horrible,_No_Good,_Very_Bad_Day)

~~~
chrismorgan
HN automatically mangles titles in various ways I’m not fond of, dropping
words that might or might not be significant, fiddling with capitalisation, _&
c._, but the submitter can go back and edit the title back to what it was
supposed to be, and the one time I’ve done that the system didn’t mangulate it
again.

~~~
dang
That's by design, because the software is obviously imperfect. It does more
good than harm, though, so we keep it.

------
sudhirj
This is one of the biggest arguments I see for serverless (AWS Lambda +
DynamoDB) or at least managed PaaS systems (Google App Engine, Heroku with RDS
or CloudSQL). These systems may seem to cost more for some workload curves (or
might even be cheaper for your curve), but the difference is worth it because
you're paying for specialized 24/7 dev-ops teams whose only job is to keep
these systems running smoothly, and by definition they're already familiar
with running workloads orders of magnitude bigger than yours. Even then the
platforms might be cheaper because you're only paying for a fraction of the
dev-ops team's salaries, but you get their full benefit.

~~~
jwr
> because you're paying for specialized 24/7 dev-ops teams whose only job is
> to keep these systems running smoothly, and by definition they're already
> familiar with running workloads orders of magnitude bigger than yours

This is based on faith — there might, or might not be a specialized 24/7
devops team who runs these things better than you.

My rational mind has trouble accepting things based on faith, which is also
why I don't trust RDS: I don't know of any way to run a distributed SQL
database without data loss (neither does Jepsen), so why would I expect RDS to
do this correctly?

Using those services does provide a warm and fuzzy feeling, though.

~~~
WJW
The specialized 24/7 devops team (if it's there) also has a few thousand other
customers instead of being there just for you. They might have other
priorities at any given moment. It's not like AWS or GCP are renowned for the
quality of their customer service.

~~~
sudhirj
This is actually a plus point if you ask me. These few thousand customers are
all operating on the same racks as me - in an environment like AWS or GCP
there's no special part of the datacenter reserved for fancy customers. The
billion-dollar customers all run their VMs on the same rack as me. So whatever
work the ops teams do to keep things reliable benefits me as much as the
biggest customers.

~~~
bigiain
Slack's problem though had nothing to do with racks going down.

I'm not convinced Amazon's team is immune from the sort of complex failure
mode described here. 'll bet there's people with equivalent sorts of stories
about where edge cases in service interactions (either their own set of Lambda
services or the AWS ones behind them, or more likely both) lead to a similar
unexpected failure cascade.

~~~
jjoonathan
> I'm not convinced Amazon's team is immune from the sort of complex failure
> mode

You're being way too kind. Not only is AWS not immune, their autoscalers are
often absurdly primitive. Like, hourly cron job doubling / halving within
narrow safety rails primitive, where it's not merely possible to find a load
that trips it up, it's all but inevitable.

This varies by service, but they always project an image of their
infrastructure being rather smart, and in the cases where I've been able to
make an informed guess about what's actually going on, it's usually wildly
inconsistent with the marketing. They don't warn you about the stinkers and
even on services with good autoscaling and no true incompatibility between
AWS's hidden choices and your needs, your scaling journey will involve
periodic downtime as you trip over hidden built in limits and have to beg
support to raise them. Sometimes you get curiously high resistance to this,
leading to the impression that these aren't so much "safety rails" as
hardcoded choices.

Oh, and just last week we managed to completely wedge a service. The
combination of a low limit on in-flight processes, two hung processes,
immutability on running processes, and delete functionality being predicated
on proper termination led to a situation where an AWS service became
completely unusable for days while we begged support to log in and clear the
hung process. Naturally, this isn't going to count as downtime on any
reliability charts, even though it's a known problem and definitely looked a
lot like downtime on our end.

We're a small (<10) team with modest needs. AWS lets us do some crazy awesome
things, but it really bugs me how reliably they over-promise and under-
deliver.

~~~
bigiain
> You're being way too kind. Not only is AWS not immune, their autoscalers are
> often absurdly primitive.

Yep. Very much so. Mostly because I don't have enough personal Lambda-specific
warstories to feel confident badmouthing it in the context of this discussion
thread. But the bits of AWS I do use are certainly not all rainbows and
roses...

I have one app/platform I run that basically sits at a few requests an hour
for 11 months of the year, then ramps up to well over 100,000 requests a
minute between 8am and 11pm for 14 days. Classic ELB (back in the day) needed
quite a lot of preemptive poking and fake generated traffic to be able to ramp
up capacity fast enough for the beginning of each day (aELB is somewhat better
but still needs juggling). We never even got close to getting autoscaling
working nicely on the web and app server plane to let it loose in prod with
real credit card billing at risk, we just add significantly over provisioned
spot instances for our best estimates of yearly growth (and app behaviour
changes) for the two weeks instead, and cautiously babysit things for the
duration.

It's nice we can do that. It'd be nicer if I didn't have to keep explaining to
suits and C*Os why they can't boast own the golf course that they have an
autoscaling backend...

~~~
jjoonathan
Hey, let's not be hasty, it sounds like you've built a bespoke autoscaling
backend that intelligently predicts future usage and dynamically allocates
compute resources to match customer needs.

------
ulisesrmzroche
Why was the “very” removed? It sounds so much better like that.

~~~
dang
HN's debaiting software took it out. We put it back.

[https://news.ycombinator.com/item?id=23756384](https://news.ycombinator.com/item?id=23756384)

[https://quoteinvestigator.com/2012/08/29/substitute-
damn/](https://quoteinvestigator.com/2012/08/29/substitute-damn/)

------
k__
After I used Discord in different contexts for months now (and Slack for
years), I can't understand why someone willingly chooses Slack.

It's the Atlassian of chat tools. Horrible performance and bad usabillity.

~~~
StavrosK
I use Zulip for the day to day (it's amazing, I can't recommend it enough),
but sometimes use Slack because some open source communities use it, and I'm
always amazed at how damn slow it is. I can consistently out-type it, it's
terrible.

I guess it was great when it started out, but they're slowly boiling the frog,
who is us.

~~~
mikecoles
Zulip won my last bake-off for chat systems. Integration was easy and the
topic method of providing threads was amazing. The only feature it was missing
was federation. In the XMPP world, you could communicate with users on other
XMPP instances. With Zulip, you can only communicate with local users. Do you
know if this is still the case?

~~~
tabbott
As other folks have mentioned, Zulip has a number of cross-server integrations
with both the Zulip protocol and other protocols like XMPP. There's a few we
document here as well as Matterbridge:

* [https://zulipchat.com/integrations/communication](https://zulipchat.com/integrations/communication) * [https://github.com/42wim/matterbridge](https://github.com/42wim/matterbridge)

We'll eventually add a more native way of connecting a stream between two
Zulip servers; we just want to be sure we do that right; federation done
sloppily is asking for a lot of spam/abuse problems down the line.

(I'm the Zulip lead developer)

------
Pxtl
Not being involved in this kind of scaled-up devops kind of stuff, my read of
this article was "HAProxy sounds _awful_ "... but they noted in the article

1) Migrating to better software, and 2) Newer versions of HAProxy are _also_
better software.

But yeah, the first half where they're discussing "we don't do X, Y, or Z
clean, idempotent way of doing config because the HAProxy doesn't perform
right if you do it that way." sounded painfully familiar for a bunch of awful
tools that I've used.

~~~
kevmo314
I don't think that's fair. HAProxy is good and reliable. It's just old for the
use cases we need today. Even the article has this opinion:

> While HAProxy has served us well and reliably for many years, it also has
> some operational sharp edges

Saying it sounds awful is like saying Vanilla JS sounds awful because it
doesn't scale to the demanding webapp use cases of today.

And, as you mentioned, it's really compounded by the fact that they're using
an outdated version of HAProxy, but that doesn't make it awful software.

~~~
Pxtl
> Vanilla JS sounds awful

... it kinda is?

~~~
kevmo314
I don't think so. Just because something isn't a swiss army knife doesn't make
it awful in my book.

