Hacker News new | past | comments | ask | show | jobs | submit login
Slack’s migration to a cellular architecture (slack.engineering)
395 points by serial_dev on Aug 26, 2023 | hide | past | favorite | 237 comments



Their siloing strategy, which I'll roughly refer as resolving a request from a single AZ, is a good way to keep operations and monitoring simple.

A past team of mine managed services in a similar fashion. We had a couple (usually 2-4) single AZ clusters with a thin (Envoy) layer to balance traffic between clusters.

We could detect incidents in a single cluster by comparing metrics across clusters. Mitigation was easy, we could drain a cluster in under a minute, redirecting traffic to the other ones. Most traffic was intra AZ, so it was fast and there was no cross-AZ networking fees.

The downside is that most services were running in several clusters, so there was redundancy in compute, caches, etc.

When we talked to people outside the company, e.g. solution architects from our cloud provider, they would be surprised at our architecture and immediately suggest multi-region clusters. I would joke that our single AZ clusters were a feature, not a bug.

Nice to see other folks having success with a similar architecture!


Yeah, I talked with a business that used a similar architecture for the same reasons. It can be really effective in multi-tenant apps where each customers data is fully independent and private. They also used multiple Amazon organizational accounts as a security partition. It made a few things more difficult but they felt the peace of mind was worth it.


My company has a pretty unique strategy where we have separate AWS accounts for each unit within the company. Each unit gets a prod and non-prod account.

We have ~150 accounts, so roughly 75 different department, with some having not much and others have a lot of resources.

Its complex, but provides a lot of nice security primitives. We have an overarching administrative account, but that doesnt get used (and lots of alarm bells go off when it is).


I assume you were using AWS? I know some of the AZ of other cloud providers (Azure? Oracle? Google?) are not fully siloed. They might have independent power and networking, but be in the same physical location.

I'm mentioning this for other people to be aware as one can easily make the assumption that an AZ is the same concept on all clouds, which is not true and painful to realise.


Azure's zones are "physically separate", but it's unclear whether zones could be in the same building. Especially since they don't guarantee distance between zones - they just aim for 300mi (483km)


Actually I assumed AWS did it the same way as the others - I thought maybe they are in another building on a campus but I didn’t think that should be a factor in planning and that I should use regions for geographic redundancy anyway.


Afaik AWS AZs are physically separate. I think some maps exist. Around here, there's 3 AZs and they're multi building campuses about 10-20 miles apart situated in suburbs outside the city freeway belt.


Yeah I definitely thought they were physically separate - I thought that an AZ might span multiple data centres too (could be wrong) and I thought they had to be at least 10km apart.


Thanks for highlighting this. Indeed all CSPs are not the same


It sounds like you didn’t have persistent data, and were only offering compute? If there’s no need for a coherent master view accessible/writeable from all the clusters, there would be no reason to use multi-region cluster whatsoever.


We did. But the persisted data didn't live inside those ephemeral compute clusters though.


So your data store was still multi AZ? I’m a little confused how you’d serve the same user’s data consistently from multiple silos. Do you pin users to one AZ?


Yeah, keep stateful stuff and stateless stuff separate; separate clusters, network spaces, cloud accounts, likely a mix of all that.

Clearly define boundaries and acceptable behavior within boundaries.

Setup up telemetry and observability to monitor for threshold violations.

Simple. Right?


i mean you could also just spin up a reeeeeally big compute node and just do it all there.

fewer things to monitor. fewer things that can fail.

just log in from time to time to update packages.

see, cloud doesn’t have to be complex.


until you get "hardware failure" note from cloud provider. Or the person updating packages makes a typo and messes up a system.

sure, use "pet" computers for experiments and dev.. but having produluction be a "cattle" makes your life so much less stressful.


I think everyone internalizes pets vs cattle a little differently, but I think people often don't realize that if you have cattle, then you now have a cattle ranch. I.e. you often create a new concept to manage your cattle, and you often run that like it's a pet.

E.g. you want to think about your machines/containers like cattle, so you put them into a kubernetes cluster, which has become your new pet. If all your infra fits on one machine, it's way easier to have that as your pet the same way it's easier to have a dog than run a livestock operation.


In cloud environments, it's pretty standard practice to automate cluster provisioning. If your cluster are pets, you're not doing it quite right.

> If all your infra fits on one machine, it's way easier to have that as your pet

What stops you from automating the provisioning of that?

These are two orthogonal issues. Clusters are used to manage many machines. If you need only one machine, you don't need a cluster. Either way, the provisioning of single nodes and clusters should be automated. Whether you have pets or cattle is not related to how many machines you have.


Having a few computers doesn't imply a "pets" approach to managing them.


more like a zoo, right?


I'm not a cloud guy, but if you're going to put everything in one region, what's stopping you from shoving everything into a bunch of containers on a t2.2xlarge instance (or equivalent), and adding like one CloudWatch alarm (or equivalent) that reboots the instance if it stops responding?


At least three obvious reasons might be:

(1) Raw scale might mean you just plain can't fit everything on a single t2.2xlarge.

(2) Different services (containers) might have different performance profiles, so you may want a few different types of machines around.

(3) You probably still want N+2 redundancy even within your single AZ, so this scheme should at least be upgraded to three t2.2xlarge boxes. ;)


will you settle for one big container?


Hey you know what; you do you.

I’m an EE by education. It’s electron state, silly leaky abstraction Stan’ing to my head.

Different babble for allocation of memory and algorithmic manipulation of the values stored within.

Correctness is important when it comes to results being mapped to human consumption and even then the subset of parameters to be be rigorous with can be made subj. personally I lean on a subset that includes biological health and well being and deprioritize religiosity


It took me some time to realize that Cloud Solution Architects are also just slightly more technical sales people in disguise whose only mission is upselling you onto more dependency. Same thing about their PR, every CxO these days says they need "multi-cloud", whatever that means and the costs are usually enormous, while complexity rises — with questionable benefit.

I did the math for our own stack and after a setback month in client revenue, and decided to put all our servers into a single AZ in a single region. The only multi-AZ, multi-region services are our backups. Surviving bad machines happens often enough that it's priced in via using Kubernetes, but losing a whole AZ is a freak accident that's just SO rare that calculating real business risk, it seemed apt to pretend it just doesn't happen (sorry, Google Cloud Paris customers).

Call me reckless, but I haven't looked back ever since and it saves us thousands of dollars in intra-AZ fees per month alone.


Yeah, for many businesses it probably isn't necessary to have crazy short RTO and RPO. Just restore the most recent backup in a new region and point at the cloud provider outage report...


I think there's this general problem with cloud deployments I'm seeing happen more and more:

People building this huge Multi AZ, Hyper Redundant, Multi-National, infinitly scaling Cloud Solution for something that requires a single VM and a Database.

Most Companies just don't need that level of scale and would be better off building something smaller and when you actually do scale you rewrite it with the profits made from the smaller solutions.

Of course there are many companies that do require something large but you should seriously consider if something smaller will do first.

I think solutions like a 100% cloudflare workers based backend can sidestep this a little but usually, that's not possibly or even the right thing in every situation.


The downside of single AZ clusters is capacity. If you have a need to drastically scale up the compute might not be available in a single AZ.


Even though each cluster was single AZ the whole system wasn't, so we weren't bound by the capacity of a single AZ.

Most of the situations where we needed to drastically scale up were known ahead of time as well (e.g. campaign from customer), and we would preallocate instances or even more clusters.

I may be forcing my memory, but if I'm not mistaken, our auto scaling was setup in a way that the system could handle sudden load increases of ~50% without noticeable disruption. Spikes bigger than this could lead to increased latency and/or error rate.


That's another way of saying your typical utilization ratio is 66%. Which is on the low side honestly.

That said, it's a trade off between efficiency and load spike tolerance. I trust that the trade off is made with informed decision.


66% isn’t low utilization. You’re always going to have micro spikes, and you never want to clip, so keeping some headroom around feels smart.

Unless you co-mingle online and offline (batch) traffic on same hosts, flat response times and high utilization aren’t compatible.


High utilization means high variability and low resiliency and the last k-percentage of utilization causes highly non-linear effects.


Smart budgeted retries can help smooth those peaks.


While I agree there's always going to be micro spikes and keeping some headroom is smart, 33% may be too much of a headroom for all but the most latency sensitive RPC services. Personally I aim for only 20% headroom.


> That said, it's a trade off between efficiency and load spike tolerance. I trust that the trade off is made with informed decision.

I don't think that relatively low utilization rates is the scenario that requires "informed decision". The only tradeoff in low utilization rate scenarios is cost, which might be outright cheaper and irrelevant once you do the math on the tradeoffs of using reserved instances vs the cost of scaling up with on-demand instances.

You need to make a damn good case to chronically underprovision your system and expect it to autoscale your way into nickle-and-dime savings.


Indeed, this is the main problem I run into. We have to scale up capacity before the traffic can be redirected or you basically double the scope of the outage briefly. Which involves multiple layers of capacity bringup -- ASG brings up new nodes, then HPA brings up the new pods.


If you have enough scale that could be a problem, cookie cutter more smaller AZs so any one outage is less of numerator of capacity over the denominator of scale.

Worth noting that requiring teams to use 3 AZs is a good idea because you get "n" shaped patterns instead of mirror shaped patterns, which have very different characteristics for resilience and continuity.


If there’s uncorrelated load you can also run on your hosts, then you can share their spare capacity, with the hope they don’t spike at same time.

AWS does that with their lambda arch to reduce waste.


Maybe, but the cost accounting is already a nightmare.


"A single Slack API request from a user (for example, loading messages in a channel) may fan out into hundreds of RPCs to service backends, each of which must complete to return a correct response to the user."

Not being a dick here but is this not a fairly obvious flaw?

I mean why not keep a structured "message log" of all channels of all time ?

For every write the system updates the message log.

I am guessing and making assumptions I know.


I imagine the base messages are in a single store. But then you have reactions, attachments, gifs, user profiles, and probably hundreds of custom integrations/plugins.

Having worked on other messaging apps, these are usually separated because their performance/scalability requirements are different


XMMP was extensible to support all this in the early 2000s. Slack reinvented simple services in the most obtuse way. I have to use Slack and I sideline quarterback all the ways things could have been better every day.


XMPP

Agree with this point of view. Except the Jabber/XMPP Cisco legal thing, there's just no tech answer on why on earth Slack did not use XMPP under the hood.


What’s even more interesting is … WhatsApp is XMPP/ejabberd based.

Slack would have known about WhatsApp architecture because it was widely talked about pre-FB acquisition (2014).

And Slack was founded in 2013.


Probably, it's the result of schizo-histerical tech decision process happening in some companies "we need fancy tech and NOT THAT XML".

Sometimes it's for OKRs and power/politics balance between departments and teams in an enterprise. "If an existing tech like XMPP is used then no serious development could be needed" fear (which is not really true). It can lead (and leads) to a huge waste of resources and overspending.

But it's a bit similar to building a luxury house. Not because an owner needs it. But because he can afford it.


Also version history (edits), threads, and links to content in other channels (sharing a message).


> When companies create this microservices bog and then, when any problem comes up, they say, “distributed systems are hard” it reminds me of when my toddler throws food on the floor then says, “look, big mess”

https://x.com/telmudic/status/1684479894406025216


This brings back memories - we speced an open distributed operating system called Metal Cell and built an implementation called Cell-OS. It was inspired by the "Datacenter as a computer" paper, but built with open-source tech.

We had it running accross bare metal, AWS and Azure and it one of the key aspects was that it handled persistent workloads for big data, including distributed databases.

Kubernetes was just getting built when we started and was supposed to be a Mesos scheduler initially.

I assumed Kubernetes would get all the pieces in and make things easier, but I still miss the whole paradigm we had almost 10 years ago.

This is retro now :)

https://github.com/cell-os/metal-cell

https://github.com/cell-os/cell-os


Can someone ELI5 the difference between using AWS availability zone affinity and then simply dropping the downed AZ at the top most routing point?

Wouldn't that be the same thing, with the obvious caveat you are t using the routing technology Slack is using (We don't - We use vanilla AWS offerings)


They decided to use every routing tool available at least once in their setup, so they can't do this. But there is no explanation in the blog about why they use so many platforms and so many routing tools. Sounds to me like they got themselves into a mess and decided to continue on that path.


Somewhere, an engineering “leader” is going to point to this blog post and then say, “Well, that’s how Slack did it!” and promptly copy this overwrought system


I’m not sure if you’re being serious, but in any case; This will happen, as it always does, inevitably.


Warning statement becomes the howto guide.


Cells are not about guarding against AZ failure, but about partitioning the production infra to protect against bad deploys and configuration changes. Every AZ is split into many different cells.


So, guarding against human errors / process failures, and not hardware failures?


Isn’t that exactly what they are doing? Keeping requests within an AZ and instead of using DNS at the first hop into AZ, they use envoy to control traffic shaping and making that initial decision if traffic needs to be routed away.


You're doing it right.


Isn’t that exactly what they are doing? Keeping requests within an AZ and using global DNS at the first hop into AZ.


So they run everything in AWS USE1? That doesn't seem very redundant, but then I guess if the whole of USE1 goes down Slack won't be the only service that will be affected.


AWS also uses Slack internally, so add that to the list of shit that can hit the fan if us-east-1/IAD goes down.


Don’t they also use Chime? It wouldn’t be a single point of failure.


To contribute to the tangled ball of messaging, slack also uses chime sdk to handle huddles


Wonder why huddles sound better compared to chime?


Lots of teams use Slack as well. Oddly enough, I didn't mind Chime as an end-user, but 6 years ago their API features were somewhat lacking.


Huh, I’m surprised they’re not all in on Chime.


It was all on Chime until the Pandemic. Then they moved to Slack.


But then everybody trying to recover from USE1 outage can't use Slack to coordinate the recovery ...


the "whole" of USE1 very rarely goes down [0], because unlike other cloud providers, Amazon's availability zones are actually independent and decoupled, and if you're running on EC2 in a zonal way it's highly unlikely an outage will affect multiple zones.

[0] There are of course exceptions that come once every few years, but most instances people can think of in terms of widespread outages is one specific service going down in a region, creating a cascade of other dependencies. e.g. Lambda or Kinesis going down and impacting some other higher-level service, say, Translate.


AZs are buildings often times right next to each other on the same street. People who think this is a great failure domain for your entire business are deeply misguided. All it takes is a hurricane, a truck hitting a pole, a fire, or any number of extremely common situations and infra will be wiped off the map. Build stuff to be properly multi-region.


Is this true of AWS? I haven’t read that Wikileaks location document in a while, but I seem to recall in official docs the AZs being placed far enough away from each other to make sure a natural disaster won’t kill a whole region (different flood planes, etc). Of course, you go to Asburn and all the buildings are really close to each other.


> AZs are buildings often times right next to each other on the same street.

Not at AWS: https://aws.amazon.com/about-aws/global-infrastructure/regio...

> An Availability Zone (AZ) is one or more discrete data centers with redundant power, networking, and connectivity in an AWS Region. AZs give customers the ability to operate production applications and databases that are more highly available, fault tolerant, and scalable than would be possible from a single data center. All AZs in an AWS Region are interconnected with high-bandwidth, low-latency networking, over fully redundant, dedicated metro fiber providing high-throughput, low-latency networking between AZs. All traffic between AZs is encrypted. The network performance is sufficient to accomplish synchronous replication between AZs. AZs make partitioning applications for high availability easy. If an application is partitioned across AZs, companies are better isolated and protected from issues such as power outages, lightning strikes, tornadoes, earthquakes, and more. AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.

This is unique compared to Microsoft, and Google (a single flood taking out multiple AZ's? Uh oh: https://www.theregister.com/2023/04/26/google_cloud_outage/)

Sure, a massive earthquake or a nuclear strike could probably take out several.


GCPs concept of Regions and Zones is different from AWS. For the same level of physical isolation as an AWS AZ you have to use different GCP Regions.

https://cloud.google.com/compute/docs/regions-zones


That link says "Google designs zones to minimize the risk of correlated failures caused by physical infrastructure outages like power, cooling, or networking. Thus, if a zone becomes unavailable, you can transfer traffic to another zone in the same region to keep your services running."

Which is clearly false if a single flood can take out an entire region.


> This is unique compared to Microsoft

Azure doesn't guarantee distance between zones, but they aim for 483km. So they can provide better isolation at the cost of higher inter-AZ latency. Depends on the region - you'd need an internal contact (and/or NDA?) to get approx numbers


When AWS is in immediately adjacent buildings, it’s for the same AZ.



Yes. to put it a bit bluntly, you are using a very generic google search and being blind to nuance.

us-east-1 does have more problems than other zones due to a variety of reasons, but it rarely (ie, once a few years) goes down as a whole. As long as you're in several AZs within us-east-1, the impact of most outages should not take you down completely. In the context of the comment you are replying to, your google search links are lazy and fail to see the big picture.


thanks - jfyi this was from my personal experience spanning a decade with USE1, but again maybe my experience is out of date, so thanks for the update.

(p.s. the use of a google search vs direct results wasn't "lazy" - it's to allow readers to do their own research vs pasting one result and then getting accused of bias)


Isn’t the point of the article that they don’t? And it describes how they implemented region drains to traffic shift between the different regions.

edit: Hmm or maybe not? I still sometimes confuse aws terminology. Perhaps it is all in us-east—1, just in different availability zones (buildings?)


If I understand it correctly they have an edge network for ingress traffic but host all of their core services in a single AWS region (USE1) in multiple availability zones there.


>edit: Hmm or maybe not? I still sometimes confuse aws terminology. Perhaps it is all in us-east—1, just in different availability zones (buildings?)

Correct, us-east-1 has several AZs, names like us-east-1a, us-east-1b etc. IIRC us-east-1 has six of them now.


How can such an architecture function with respect to user data? If the DB instance primary handling your shard is in AZ-1 and AZ-1 gets drained, how can your writes continue to be serviced?


Usually in distributed strongly consistent and durable systems, data is not considered committed until it has been persisted in multiple replicas.

So if one goes down nothing is lost, but capacity and durability is degraded.


That makes sense on its own, but doesn’t it mean that there are lots of network requests happening between silos all the time? It doesn’t seem very siloed.

Or is this some lower-level service that “doesn’t count” somehow?


It's siloed that if one is down others are not affected as long as enough other replicas are healthy to keep the quorum.

You always need cross-AZ traffic, otherwise your data is single homed (which we used to call "your data doesn't exist").


I think you are right, probably some service like kinesis of kafka that is keeping them in sync. that 'doesn't count'


Multiple tiers of redundancy. There is usually redundancy within the AZ and then a following copy in another AZ. Usually at least four copies exist for a tenant.


Seems to be the same collection of services deployed in different AZs with a load balancer? The trick would be how data is replicated across the instances, which I'm guessing is some sort of event publishing or even backup sources of truth? It says that will come in the next article and surely that's the more interesting part than the load balancing...

Also explains to me why new features would take a while to roll out of you are cautiously updating instances/AZs one by one


Is Slack still written in Hack/PHP?


Yes — see my recent article https://slack.engineering/hakana-taking-hack-seriously/

We use a few languages to serve client requests, but by far the biggest codebase is written in Hack, which runs inside an interpreter called HHVM that’s also used at Facebook.


I noticed that the hack blog (https://hhvm.com/blog/) basically stopped posting updates since the end of 2022. As downstream users of hacklang development have you folks noticed a change in development pace or ambition within the hack development team?


I too am super curious about this.

Plus, it seems telling that Threads was developed in Python - not Hack.

(I’m aware IG is Python & it’s the same team)


You answered yourself there, Hack is still very widely used inside meta, just less so in IG.


If anything from what I’ve heard hack is slowly taking over IG and WhatsApp. But it’s an incredibly large codebase to move


Kinda makes sense you would use PHP, even though I'm sure many people are shocked by it. PHP was pretty much born in a web context. The language was created with servers and request/response in mind and it shows.


I really like the writing style in that article:

> PHP makes it really easy to make a dynamically-rendered website. PHP also makes it really easy to create an utterly insecure dynamically-rendered website.


PHP has some excellent ideas that other languages can't replicate, while at the same time having terrible ideas that other languages don't have to think about. Overall a huge fan of modern PHP, thanks for this writeup.


Which excellent ideas does it have that other languages can't replicate?


Perhaps more precisely: the defacto Apache-as-runtime + PHP model simplifies a ton of things. Namely your request state is created and destroyed all within the context of a single process, and you don't have to reason about shared state with other in-flight requests (unless you explicitly choose to go this route). It makes some bad programming patterns workable, because your state doesn't linger over a long-running period. Deploys are also super fast, you just have to swap the application code on disk and it'll get picked up on the next request (in-flight requests will keep processing with the old version IIRC). It's productive if not necessarily pretty. Also it has a type system now!

As a related thought, a lot of the modern serverless stuff feels like it's reinventing the ideas of Apache + PHP, or perhaps CGI?


Hi Matt

Thanks for Psalm!

Curious, if Slack was built today from ground up - what tech stack do you think should/would be used?


That’s a simple question that’s hard to answer.

A slightly different question that’s a bit easier to answer: “if I could wave a magic wand and X million lines of code were instantly rewritten and all developers were instantly trained on that language”.

There the choice would be limited to languages that have similar or faster perf characteristics to Hack, without sacrificing developer productivity.

Rust is out of the question (compile times for hundreds of devs would instantly sap productivity). PHP, Ruby, Node and Python are too slow — for the moment at least.

So it would be either Hack or Go. I don’t know enough about JVM languages to know whether they would be a good fit.


Thank you for being brave enough not to suggest Rust.


I like your question way better than mine :)

Some follow-up …

A. isn’t PHP on par perf wise to Hack these days? Re: “PHP is too slow” comment.

B. have you ever looked into PHP-NGX? It’s perf looks impressive, though you lose the benefit of stateless

https://github.com/rryqszq4/ngx-php

https://www.techempower.com/benchmarks/#section=data-r21


> isn’t PHP on par perf wise to Hack these days?

No. But I don't have any numbers, because it's been years since the two languages were directly comparable on anything but a teeny tiny example program.

Facebook gets big cost savings from a 1% improvement in performance, so they make sure that performance is as good as it can possibly be. They have a team of engineers working on the problem.

PHP doesn't have any engineers working on performance full-time — it's impossible for the language to compete there. Hack has also removed a bunch of PHP constructs (e.g. magic methods) that are a drain on performance, so there's no way to close the gap.

But that should in no way make you choose Hack over PHP. Apart from anything else, the delta won't matter for 99.9% of websites.


Yes, Hack is for Google or FB scale stuff. But to be honest, Slack is probably up there also, so it makes sense


PHP served by an nginx (never saw the abbreviation NGX to be honest) server is standard procedure in PHP land.

Other alternatives are Apache, Caddy, and more…


Not erlang?


But Discord uses Rust to improve performance bottlenecks in OTP ;)


from the article:

>Slack does not share a common codebase or even runtime; services in the user-facing request path are written in Hack, Go, Java, and C++.


Man what a mess. Meanwhile, everyone else can extend a library used by their common services in a common language trivially.


Meh. As long as you’ve got a good, typed interface for passing messages between them and for having a common understanding of (and versioning system for) key data structures, that’s fine for this sort of thing where it’s largely processing steams of small messages and events.

… but it’s probably JSON and some JSON-Schema-based “now you have two problems” junk instead of what I described. In which case, yeah, ew, gross. Unless they’ve made some unusually good choices.


There are tons of approaches to align on service contracts for JSON based API calls. There’s also libraries like gRPC which help make contacts explicit. Neither are really uncommon


What are some of those approaches? Are there formal methods and/or tools for doing this?


Almost everyone embraced polyglotism and microservices together.


Let me guess, they should rewrite everything in Javascript?


Woe is us if they actually did.


Nah, Excel. /s


Is “cellular architecture“ hipsterspeak for micro services?


Not exactly IMO, it's more about the orchestration of micro services w.r.t quasi physical boundaries.


Nice write-up!

    If no new requests from users are arriving in a siloed AZ, internal services in that AZ will naturally quiesce as they have no new work to do.

Not necessarily because, due to some bug, there may be resource-hungry jobs running indefinitely. (Slack's engineers must have considered this; I am just nitpicking this particular part of the text.)


If you replace "because" with "if", your comment makes more sense. "If" there are such bugs, you are right, but such bugs might not exist.


Delighted to be part of this conversation on cell-based architecture. As the author of the cell-based reference architecture https://github.com/wso2/reference-architecture/blob/master/r..., I'm here to share insights on this exciting approach.

Cell-based architecture introduces modular 'cells' into software systems, each with distinct APIs. This design fosters loose coupling and scalability – key for today's dynamic software landscape. Particularly, for those intrigued by microservices, cells align seamlessly with the independent, scalable components that power microservices architectures.

Curious to dive deeper? If you're keen to explore the nitty-gritty technicalities, I invite you to check out the architecture paper https://github.com/wso2/reference-architecture/blob/master/r... for an in-depth understanding. Let's kick-start this dialogue on the potential of cell-based architecture and its impact on modern software design. Feel free to join the conversation!


This seems written by AI? And as such it comes across as not genuine


It's from WSO2, a company selling enterprise middleware. Of course, the paper would read like something you bring to a sales meeting, not a tech talk.


Interesting feedback, I'll see how I can address this perception of a 'sales pitch' during the next revision. The intention was to define a vendor and technology-neutral reference architecture to the community.


No, it is not written by AI :). You can look at the Reference Implementations section of the spec to find who is using the spec https://github.com/wso2/reference-architecture/blob/master/r...


I'm familiar with the cell architecture promoted by the WSO2 papers (and other resources). I like it and I've used it in client projects.

However, this article uses "cell" in a completely different way. It is not the cell-based architecture that you are promoting here without reading the article.


In the CBA paper, a 'Cell' is an architecture construct you can use in the design stage and take through the development and then to the deployment. So, it addresses both application and deployment architecture. Each cell has a boundary, a cell gateway, and components inside the cell.


Thanks for the follow-up and I apologize for my snark. It appears my knowledge was out-of-date and I stand corrected. I'll need to brush up on the current docs and research.


No problem, it's my pleasure.


If each AZ is siloed, then how can different AZs serve the same user/workspace?


their backend being on 2G explains a lot of other stuff about their software


I appreciate the clear explanation of the problem and the solution, which (as is so often the case) seems fairly simple or obvious in retrospect.

Semi-related tangent: sometime around mid-2016, I came across a tool that helped visualize requests in near real-time, and showed what it "looks" like (ie, flow slows to trickle in service A during draining, while it ramps up in service B)... there was a really compelling demo, but I never bookmarked it and can't seem to find it. IIRC its name was a single word. Maybe someone reading this will know what I'm talking about... ?


Vizceral


Neat.

> If a graph of nodes and edges with data about traffic volume is provided, it will render a traffic graph animating the connection volume between nodes.

How would one go about providing such a graph? :)


YES! Thank you!! :)

This kind of shared communal knowledge is one of many reasons I'm very grateful for the HN community.


Thanks to EU, Microsoft Teams replaced Slack. GDPR makes it way too difficult to work with multiple software vendors, so companies usually only choose products from the absolute minimum number of vendors (even if there are better options). Also Slack asks too much money for what it does.


They got themselves into a mess here:

> This turns out to have a lot of complexity lurking within. Slack does not share a common codebase or even runtime; services in the user-facing request path are written in Hack, Go, Java, and C++. This would necessitate a separate implementation in each language.

This sounds crazy. I've seen several products where there is a core stack (e.g. Java) and then surrounding tools, analytics etc in Python, R and others. But why would you create such a mess for your primary user request path?

Sure, they're not "just a chat app" they have video, file sharing etc included and a lot of integrations. But still this sounds like a company that had too much money and too little sense while growing rapidly.


"The right language for each job" was one of the heavy advertising points for microservices. Might still be too some extent, even.


The problem is most engineers don't understand the "job".

They see the job as a strictly technical problem looking for the best technical solution. They don't look up and see how that problem fits into the larger organization.

They think things like "I can make a microservice that encodes PDFs 10x faster by using Rust" and give an estimate based on that, never thinking about how we're going to need to hire 2 more Rust devs to keep that running, and we could have delivered twice as quickly if I had used our default Python stack and now our "10x faster" doesn't matter because that feature is old news.

Microservices are such an unfortunate concept because they attract the people least suited to use them: If your team can't handle a monolith, you shouldn't even be looking up what a microservice is.


It even pains me to see they're suffering from so many own goals. And it's unfortunately reflected in the poor experience using the Slack client. Not to mention the multiple deprecated bot/integration APIs with such bad feature parity between all the different ways to integrate your own tooling into Slack.


What do you mean? Slack is one of the most responsive and reliable tools I touch every day.


How slow are the rest of your tools? The Slack client probably performs worse today than it did a few years ago. It has the laggiest interface of any of my tools, you can watch your CPU spike to 60-80% just switching channels. Just do it right now, open up htop/top/atop/Activity Monitor - whatever you want, and just switch channels. Laugh as the Slack client wastes a universe's worth of time just... rendering a DOM with plain text. It is genuinely pathetic how bad the client is.


I hope this is satire. Slack is one of the slowest work tools I've ever used. Every interaction and click visibly lags.

It's a sad state of the world that almost every application now is written in Javascript and deployed with Electron, and massive memory usage and slow UIs have become accepted as the norm.

Try any IRC client and tell me, with a straight face, that Slack is just as responsive.


Very interesting. I have used Slack for work on a MacBook for years, and I have never once noticed any responsiveness issues or lags at all.


So I only use Firefox and the Slack web client and don't experience any lag. I am surprised so many people use the Slack app over a web tab.


Ok genuine question: what other tools do you use?

Slack won’t seem that bad, until you use something that’s actually good.

The loss of performance on commonplace applications has been a real “boil the frog” situation; we’ve lost so much performance and responsiveness, but it’s happened so gradually that most people don’t notice.


Well, status.slack.com says they're currently having an outage, that has been ongoing for multiple days.


There was a time when this was the case (and electron was the punching bag for critics at the time, iirc) but I don’t think this criticism is fair anymore. Slack is quite responsive and performant these days.


The only way you get Hack on that list of languages is that they had a policy of letting lead engineers starting a project to choose the language at will, and they hired enough lead engineers who previously worked at FB/Meta.


I think that Hack might’ve been on that list earlier than you think. Slack started as a PHP application.


Yeah. If they already had a large php codebase, moving to Hack makes complete sense.


What mess? That sounds like a healthy internal language ecosystem to me. You need at least 2 primary languages to avoid accidental lock-in and maintain good developer diversity. That very paragraph is a great example of how the diversity helped them avoid the trap of plumbing it through their RPCs.


Since when is an "internal langue ecosystem" a good idea? Technology in a company like Slack exists to deliver useful features and good performance/stability to users faster than competitors can do it. For an app like theirs it doesn't sound like something that needs several disparate internal platforms that are slowing them down.


How is choosing the right language for a task/team slowing them down?

For large scale, cross cutting initiatives you’ll have some pain. For feature velocity, you’ll see great results. Everything is a trade off.


Standardizing on a single language/toolset is a form of tech debt and has similar up/downsides. Comparable to, for example, how skipping testing will increase your velocity in the short run, but it will hurt it in the long run.


You're suggesting that needing to reimplement the same thing 5 times for every single language in use is a hallmark of a "healthy internal language ecosystem"?


That's the red flag, the thing you are trying to avoid. You don't want to implement things in each language and you always have more than one language even if you standardize (over time). You don't want libraries, you want services. This is why things like Istio are way better than libraries for mesh networking. Using external services for common things keeps you from being locked into a single tech stack and the limitations that entails.


The thing I don't understand about Slack is how the core functionality seems to have continuously degraded since I started using it in ~2015. When I started using it, its core message sending features basically didn't have the issues with delayed messages or failure to send that I had experienced with competitors. Now, I routinely have to reset the app/clear the cache and go through various dances to get files to upload reliably (add the file to a message, wait five or ten seconds, then hit send). It's nice to see these technical write-ups about improving the infrastructure behind Slack, but I'd like to see fewer feature launches and more stability improvements to make the web, desktop and mobile apps feel like reliable software again. (nice to haves would be re-launching the XMPP and IRC bridges)


Not to “works on my machine” you, but I…genuinely do not have these problems. I’ve never heard it from my team either. So we could at the very least say it’s not a widespread global issue.

Even the percentage of nerds that would want IRC or XMPP bridges back would have to be vanishingly small. I’d be annoyed if Slack reimplemented such functionality because it no doubt slows down future development. Slack has a number of mechanics that do not carry across to IRC or XMPP, and they did when they killed the bridges. I’d be annoyed if new features were compromised to increase compatibility with this blatant nerd vanity project.


So, it's workspace and user/device specific: two of the workspaces I interact with regularly have these problems and the problems also show up intermittently for some users and not others. (Anecdotally, my experience is that Matrix/Element used to be annoying compared to the Slack experience and now I mostly prefer it to Slack)

I would be fine with the understanding that the IRC bridge was missing functionality (and it always was). Although threads might make it impossible to implement in a nice way now.

As far as new features go, I don't want any new features in Slack: it worked exactly like I wanted it to seven years ago and the new stuff is nice, but not worth the degradation in user experience.


I’m gonna go to bat for Slack on this one and say the “Later” feature they added recently has completely changed my workflow for the better. It’s so simple, but removes all the cognitive overhead of feeling pressured to deal with specific messages real time, else I forget about them. Now I just throw the message in Later and get back to it when I’m free.


I haven't used Slack in a long time, but isn't this just the normal enshittification cycle that occurs with all Internet products? The founders got a nice exit several years back, I doubt they stuck around at Salesforce for long, so it's natural that the product would deteriorate over time.

Slack IRC bridging in the 2014/2015 era was great. We had a lot of people who spent their whole workday in a terminal window and weren't interested in running a web browser in the background continuously just for a chat room.


> Isn't this just the normal enshittification cycle that occurs with all Internet products?

Yeah, although one can dream that some SaaS company would do htings differently


>isn't this just the normal enshittification cycle that occurs with all Internet products?

No! Stop diluting this word.


> No! Stop diluting this word.

Yes, you're right, I'm misusing it.

However, I think that there is a phenomenon that happens to a lot of tech products that is more general than what Doctorow is talking about. There is a certain type of person who is attracted to building a new thing, and there is a different type of person who is attracted to a thing that is already successful. Pioneers and Settlers, as a former colleague of mine described it. In the context of Internet services, pioneers care a lot about attracting users initially so they tend to dwell on every minor detail. Settlers care a lot about stability, so gradual degradation over time (e.g., in performance, in other measures of quality) is tolerable as long as its rate is controllable and well-understood.

I think that Doctorow's thesis is a special case of this where greed is the driving factor behind the gradual erosion of quality.


This is the Cory Doctorow sense of the word, is it not?

(Or, now that I notice your username, maybe you’re making an ironic joke, since complaining about the misuse of the word enshitification is a meme now?)


enshittification of the word enshittification?


They support much much larger workspaces now, and support team to team shared channels, so the problem space is much more complex than 2015.

Not saying they shouldn’t fix their reliability. Every other week it seems like they have an outage with this or that.

The Flickr style commit to production multiple times per day seems to have its limits. Perhaps longer canary and slower rollouts would help.


"For example slack is an incredibly successful product. But it seems like every week I encounter a new bug that makes it completely unusable for me, from taking seconds per character when typing to being completely unable to render messages. (Discord on the other hand has always been reliable and snappy despite, judging by my highly scientific googling, having 1/3rd as many employees. So it's not like chat apps are just intrinsically hard.) And yet slack's technical advice is popular and if I ran across it without having experienced the results myself it would probably seem compelling."

https://www.scattered-thoughts.net/writing/on-bad-advice/


I really wouldn't judge the quality of some company's technical advice based on one person's experience with their UI. For almost any consumer software that gets mentioned here, you will find some people who love it and lots of others with gripes. And for e.g. Slack might have bad product/UI people but very good infra people. Better to look at TFA and judge it on its merits.


The proof is in the pudding, not the recipe blog post.


It isn't one person's experience with the UI. It is everyone's. If you don't think Slack is slow then you have forgotten what "slow" means. It is a chat program. It is incredibly simple. It is not doing anything complicated. We have gigabit internet, CPUs with multi-GHz clocks and high IPC rates, NVMe 4 SSDs that load data from disk almost instantly. It should open in milliseconds, not several seconds. That it ever takes a noticeable amount of time to do anything reveals deep flaws in Slack's engineering culture, because it shows they just don't care about performance at all.

If they had "good infra people" then their program wouldn't sit and spin for seconds, ever.


"cellular architecture"

What? Does amazon need to push for new sales points or are they simply making up architectures now?


Cell architecture goes way back, at least 10 years. Tumblr for example.

http://highscalability.com/blog/2012/5/9/cell-architectures....


A big term for a simple design principle indeed.

But their implementation isn’t as grim as what I had initially envisioned when hearing that term. I immediately thought of Smalltalk and the idea of objects sitting next to each other, forming a graph (of no particular structure… just a graph), passing messages to neighbours. Like cells in an organism send hormones and whatnot. That makes for a huge mess that cannot be reasoned about, hence why we instead went with stricter structures like trees for (single) inheritance. That’s much closer to this silo approach, which seems nice and reasonable (although I get the impression considerable complexity was swept under the rug, like global DB consistence; the siloes cannot truly be siloed).


Yeah, the idea was present at my bank employer 15 years ago. Drain a DC to do maintenance and load testing. It was called high availability.

This blog is writing about availability zones as if they're a new concept too.


If amazon does something right then thats marketing.


ex-AWS here

May be marketing but it is an architecture born out of Amazon's (and AWS's) use of AWS:

- Reliable scalability: How Amazon.com scales in the cloud, https://www.youtube.com/watch?v=QeW9wCB36ck&t=993 (2022)

- How AWS minimizes the blast radius of failures, https://youtu.be/swQbA4zub20 (2018)

For massive enterprise products like Slack that need close to 100% uptime across all their services, cells make sense.


Yeah that's what microservices were meant to achieve. Suppose the market is staturated with "microservices", so a new term was needed.


Microservices is one reason you need cells. If you haven't, the second talk I linked to might interest you.


Cells, interlinked.


Sounds like it's two bird with one stone.


Why is than an either/or?


Initially read this as: "Slack's Migration to Cellular Automata" and now I'm a little disappointed.


Cellular architecture? They've just rediscovered the art of redundancy systems


Indeed, for 20+ years of distributed data centers (remember AZs are generally separate DCs near a city but on different grids, regions are geographically disparate cities) we called it "shared nothing" architecture pattern.

Here's AWS's 2019 guide for financial services in AWS, where the isolated stack concept is referenced under parallel resiliency section and called "shared nothing":

https://d1.awsstatic.com/Financial%20Services/Resilient%20Ap...


I wish modern architecture writing could go back to being this straightforward:

"There are three dominent themes in building high transaction rate multiprocessor systems, namely shared memory (e.g. Synapse, IBM/AP configurations), shared disk (e.g. VAX/cluster, any multi-ported disk system), and shared nothing (e.g. Tandem, Tolerant). This paper argues that shared nothing is the pre- ferred approach."

https://dsf.berkeley.edu/papers/hpts85-nothing.pdf


But if they call it cellular architecture, it sounds much more exotic than a shared-nothing active/active service!


It's a common pattern in tech. Everything old will be new again.


But kubernetes


To me it seems without the art. The costs will be passed on to the customers. I think there must be good ways to do redundancy without having all services running at full blast in each Availability Zone.

It's a blunt tool, much like PHP. PHP does seem to be a good choice for them, but I wouldn't want to work there. It's all right, there are different ways to do stuff.


Oh hey they now have a new buzzword to sell!


So they used a feature built into a load balancer to gracefully drain traffic from specific availability zones? Odd that a feature found in load balancers from the last 25 years is a blog post worthy thing.


Close but I don't think it's quite 25 years! I added graceful draining to Apache httpd's mod_proxy and mod_proxy_balancer either in 2003 or 2004, and at the time I'm nearly certain it was the first software load balancer to have the feature, and it wasn't available on the hardware load balancers of the time that I had access to ... though I later learned that at least BigIP load balancers had the feature.

At the time, we had healthy debates about whether the feature was useful enough to justify additional complexity, and whether there could be cases where it would backfire. To this day, it's an underused feature. I still regularly run into customers and configurations that cause unnecessary blips to their end-users, so it's nice to see when people dig in and make sure that the next level of networking is working as well as it can.


Microsoft bought Convoy in 1998[0]. Then incorporated it into NT4sp6a and Win2k as NLB/WLBS. One of its features was to gracefully remove a server from the cluster after all connections were closed - draining. But, cluster not the same as an LB.

[0] https://news.microsoft.com/1998/08/24/microsoft-corp-acquire...


I migrated some old BigIP load balancers over to Apache in 2004ish, and extended some of mod_proxy to do some "unholy" things at the time. We also did a lot of direct server return stuff when no load balancer you could buy could handle the amount of traffic statefully. Man, how times have changed, and lesson forgotten.


Well played, HN.


That seems like a shallow dismissal. In a distributed system, making sure that sub requests are handled across distributed nodes within the local AZ, and correctly draining traffic from AZs with partial component service outages, is not as trivial as 'using a feature built in to a load balancer'.


It may be shallow, but architecting for this is not really "advanced, FAANG-only accessible methodology". I'm surprised their services have been as "reliable" as they have been considering such trivial stuff is just now being employed in their architecture.


Half the complaints on here on architecture posts are 'you don't need this kind of stuff unless you're at FAANG scale'. Now we have a write up of something that's accessible to businesses at non-FAAANG scale, and we have the new complaint, that this kind of stuff isn't worthy of FAANG-scale architecture.


Geo traffic distribution, multi regions/AZs with functionality to weight and drain traffic should be used in most SaaS services where a simple failure somewhere could cost users time and lose company money/goodwill. It's not terribly hard nor expensive.


Those are all much looser restrictions than routing traffic consistently to a cell


Route 53 latency based routing -> APIGW or ALB -> Lambda or Step Functions -> DDB Global Table.

No reserved capacity (pay for usage), so it works for boot strapping startups and provides superior resilience while being extremely simple to setup and involves almost zero maintenance or patching (even under the hug of death). I don’t understand settling for less (and taking longer and paying more for it).


How do you think people are going to learn this stuff if not by reading about it from architects who have done it?

This writeup seems like a useful contribution to spreading this knowledge that you think every engineer should, somehow, innately be born with, to those members of the development community who missed out on picking this stuff up in elementary school.


> architecting for this is not really "advanced, FAANG-only accessible methodology"

Sorry - where are you quoting this claim from?


The S in FAANG is for Slack.


My own words, but this is fairly trivial in the context of these massive companies with presumably PHDs working on their architecture.


I don't quite know what a PhD-grade architecture would look like, but this seems like a reasonabke one to me.


The other bit is separating the service into isolated cells so issues in one don't affect dependent services everywhere like they had experienced before.

But yeah any good SRE could point this out years ago.


Just odd a company worth billions and billions of dollars is just now discovering HA models standard since the 90s. Can expand the Clos network architecture to these distributed service applications too. But judging by Slack's client quality, mature concepts such as those must be new to them.


The linked AWS article specifically explains that it’s not just the typical single load balancer for cross AZ routing. I frankly don’t know where you’re getting that this means that HA is new to them.


Of course this isn't a typical single load balancer for cross AZ - but the general gist of their "new" architecture is first principles level of design. But sure, we can celebrate their minor achievement I guess


Is Slack dead? unironically. Does it have a future? With Teams, etc. coming out, it seems most companies do not want to go the Slack route


If your goal is to monitor your staff and gather metrics on their communication - Teams outdoes Slack and is incomparable. If your goal is to have a platform that enables your employees to communicate with as little friction as possible, I have yet to see anything capable of replacing Slack.

Teams especially, is something I loathe using every day. Everything about the UI and UX gets in the way of what I’m trying to do, rather than assisting in or even enabling it. It’s like it doesn’t want me to communicate - it wants me to react and offer as little useful information as possible.


I went from a company using teams to slack a few years ago. Truly night and day. I have such a visceral hatred for Teams, it actually surprises me how much I can dislike some software that is for messaging. From how it can't copy and paste in and out of chat, to the way it sets laptops on fire, or its horrible ui. I really truly hate that software. Please just use slack or god forbid set up an irc node or something.


Agreed. Teams is already the most painful experience, and it's about to get even worse with the new 2.0 version being deployed.


Teams is doing well because it's often an IT department's simplest choice, but I don't find it's great for users.


The company I work for has a "Hours wasted because Teams sucks" page that gets updated at least weekly.

Eventually the list will grow so large that we could probably attach a 5-figure dollar amount to it, if it hasn't already.


Depending on the size of the company, that value is absolutely insignificant.


Bigcos with robust sales truly can’t afford the organizational-attentional cost of walking across the street to pick up a $10,000 coin.


/s ?


No, that’s really how it is. They leave opportunities to save or make five-figure (and larger) amounts all the time, because it’s not worth the distraction from other activities. And also from straight-up mis-management, but a lot of the time they know exactly what they’re doing, and it’s on purpose, and it’s probably not a mistake.


Why would I choose Slack for my employees when Teams integrates so nicely with everything else in the "stack". Teams is leaps and bounds ahead already, and Slack really lost the boat many years ago.

Speaking of which, I'm going now to buy more Microsoft shares.


I don't think I have ever heard someone favorably compare teams chat with slack before. Even when I worked at a company that used teams for video calls and MS for email and calendar and documents and what not, everyone used slack for chat.

I don't think anyone was sad that slack didn't integrate with the other MS services "stack".


You are choosing teams. What are your employees choosing? In my experience teams is a terrible mess and a company using it would exclude me from working for the company because they very likely don’t give a crap about the day to day experience of the employee.


Maybe because you value your employees being able to copy an image from your chat platform?

(Teams still can't copy images, instead you get a massive base64 block of text iirc)


I don't like that either, but I get it, it breaks the privacy of the sent message as messages have permissions attached to them.

Honestly, bugged me a few times before I just switched to using the snippet tool. I use it all the time anyways and this felt natural.


It's not base64 because of privacy, and using the snipping tool instead of just copying is not natural. Imagine opening a whole other program to copy some text you just highlighted.


> Why would I choose Slack for my employees when Teams integrates so nicely with everything else in the "stack".

Does it really though? In my experience teams has a buggy integration with other things in the stack.

And Teams itself ia massively buggy and a resource hog for the whole time I've used it.


The boat or the moat?


I started using Slack in 2015 and thought it was a great product.

Since then, they have hit 2 home runs on top of their basic chat functionality:

1. Slack Connect: Being able to share channels between workspaces is simply amazing. Most of our customers are on Slack, and having a Slack connection to them makes it much easier to communicate with them and get their feedback as we improve our product. I don't know any other tool that even comes close to how important Slack Connect has been to product development in my startup.

2. Canvas: They rolled this out last year as notes or something, and I was pretty underwhelmed with the experience at first. But very recently (within the past month I think) they reintroduced this as "canvas" with really tight integration with threads. We have moved all of our planning and synchronization activity to a canvas that we set up every week.

Although these features are not difficult for Google or Microsoft to implement on a purely technical level, their product organizations don't seem to understand the network effects of chat the way that Slack's product organization does.

Slack is certainly not dead today and they are showing the savvy to stay alive well into the future.


My employer buys no Microsoft SaaS service, since we're mostly on Google services, so a stand-alone like Slack works quite well. And nobody uses Google Chat.

And besides that, the UX of Teams is miles behind Slack.


Slack is not good UX in my opinion. It is often hard to see what generated a message notification - so yeah someone called me out, but who? where?. It shows me latest thread as being from last month when I know there have been more recent ones. It doesn't collapse those threads, so 100 reply incident threads dominate that view. Slack doesn't scale well (UX-wise) above say 30 people.


Opinions are valid, for sure. I can tell you that I’m a happy slack user at a company of just over a hundred thousand.

I haven’t regularly used teams in about a year, but I would legitimately consider passing on a job offer where they used it.

In a thread where many folks are talking about using the best tools for a job, teams is never the best tool for any form of digital communication.


Not even GitHub? I believe that's the only MSFT service we have at my <40 people fintech dayjob


Clearly no. Legacy inertia will carry it pretty far, even if literally nobody new tries to sign up for it. Our team is still using Slack and has no plans to migrate away at the moment.


One day Zulip will take over everything. Probably the same year Ubuntu beats Windows as the majority desktop OS.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: