To be fair to Slack, at their scale, lots of moving parts might make sense, but I see a lot of companies (including startups with very few customers) going down the microservices route and exposing themselves to such a risk when there is no major upside beyond giving engineers lots of toys to play with and slapping the "microservices" and related buzzwords on their careers page.
Microservices (like just about anything) can be implemented well or poorly. There's a reason we have sophisticated orchestration solutions like Kubernetes... it exists to tame large scale deployments that has sensible failover processes.
The benefits you get are services that can be scaled independently, deployments that only affect isolated pieces of code, horizontal scaling, dockerized environemnts, etc. All of these advantages should exist in well designed systems, but systems that have been executed hastily will likely have critical problems crop up at some point.
I am not saying that microservices is a problem for Slack (at their scale it can make sense), but I am expressing my overall concern about smaller companies going down the same route when their scale or the problem being solved doesn't justify it and they end up having to deal with the (self-inflicted) problems of a massively distributed system with no major upside as well as my personal opinion of why I feel uneasy working on systems where I don't have a full overview of how it works and its potential failure modes.
When it comes to the benefits of microservices I am not sure whether those are all worthwhile considering the overhead and extra complexity of development on a microservices architecture.
In a monolithic application, most data and functions you might need are just a function call away and you typically have one or a handful of databases to interact with, often abstracted away by an ORM. In a microservices architecture, you suddenly need to worry about serialization, authentication and communication between services (and its failure modes, etc) and might require coordinated changes across several services, each of which might use a different language/framework and deployment process.
In terms of getting started, it has always been easier for me to work on a monolith where the codebase makes up for bad/no documentation because my IDE can resolve the majority of the symbols and allows me to see where the data I need lives and where it's being used. In a microservices architecture all of that goes out the window and you need to do a lot more manual "discovery" work searching through the documentation (if there is documentation, which is not a given) and manually figuring out the RPC calls because IDEs typically can't resolve cross-service communications.
Running a monolithic application locally is a lot easier than a microservices architecture. For the former you can typically get away with just a database and cache server all running natively. The latter pretty much imposes a container-based stack where you are now running 10 databases, caches, reverse proxies and everything involved around service discovery, which adds yet another layer of abstraction and makes you spend more time on this useless plumbing than actually getting work done and delivering business value.
In a monolithic architecture the devs that deal with it, have to deal with the program as a whole. So if something doesn't work, its their problem. Where as in a micro service architecture it can be easy to spin up a service and not know the systems that integrate with it.
The problem here is with documentation and understanding of architecture. Its just the nature of the beast that the monolithic dev knows how thing communicate with the monolithic program because he needs to know, in order to do his job. In this instance the problem isn't with micro services, its with the execution. And that execution is a very easy trap to fall into with micro services.
This exact bug could have bit a large monolithic app running on a VM.
The legacy system supporting Slack in production was heavily resource-constrained as they were moving to a new fancy system. Slack admits here that the legacy system likely wasn't getting the attention it needed and lo-and-behold it started failing in mysterious ways.
Organizational failure by not properly calculating all the risks caused by rotating out part of their load-balancing system. They probably should've asked for more budget here to keep their existing system functional as they slowly transitioned to their new system.
They admit that COVID caused all their systems to become stressed, they probably had appropriately budgeted for the transition to Envoy whenever they asked management(probably pre-covid). The team likely was never meant to support both the load they're now seeing during COVID while transitioning to a new system.
Either way during any transition, there's a period where you must support both systems at full capacity until the legacy system can be gracefully decommissioned.
Postmortems should start with a summary paragraph like the above, and then go into story and full details below.
1. Fill out out a change request (CR) form and print it.
2. Have it signed by your manager and the managers of every system it touched, including business owners.
3. Attend the 2x a week meeting and explain your CR. In this meeting, explain what was happening, why, who authorized it, what to do if it failed, what to do if it initially worked but failed later (e.g, on a weekend).
4. Hope your CR passes the vote.
5. Implement your roll-out plan.
This is a robust process. Where it breaks down--I felt--is when you need to fix a typo on the public-facing website that's managed by a CMS.
0 - https://www.eesi.org/files/070913_Jay_Caspary.pdf
Look, I'm sorry, we've been over this. It's the design of our back-end. First there's this thing called the Bingo service. See, Bingo knows everyone's name-o, so we get the user's ID out of there. And from Bingo, we can call Papaya and MBS (Magic Baby Service) to get that user ID and turn it into a user session token. We can validate those with LNMOP. And then once we have that we can finally pull the users info down from Raccoon.
I revisit this video every now and then.
"You think you know what it takes to tell the user it's their birthday?!"
This guy engineerings.
xDS lets you give your frontend proxy a complete view of your whole system -- where the other proxies are (region/AZ/machine) and where the backends are, and how many of those. It can then make very good load balancing decisions -- preferring backends in the AZ that the frontend proxy is in, but intelligently spilling over if some other AZ is missing a frontend proxy or has fewer backends. And it emits metrics for each load balancing decision, so you can detect problems or unexpected balancing decisions before it results in an outage. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overv...
I also like the other features that Envoy has -- it can start distributed traces, it gives every request a unique ID so you can correlate applications and frontend proxy logs, it has a ton of counters/metrics for everything it does, and it can pick apart HTTP to balance requests (rather than TCP connections) between backends. It can also retry failed requests, so that users don't see transient errors (especially during rollouts). And it's retry logic is smart, so that if your requests are failing because a shared backend is down (i.e. your database blew up), it breaks the circuit for a period of time and lets your app potentially recover.
The result is a good experience for end users sending you traffic, and extreme visibility into every incoming request. Mysterious problems become easy to debug just by looking at a dashboard, or perhaps by clicking into your tracing UI in the worst case.
The disadvantage is that it doesn't really support any service discovery other than DNS out of the box. I had to write github.com/jrockway/ekglue to use Kubernetes service discovery to map services to Envoy's "clusters" (upstreams/backends), but I'm glad I did because it works beautifully. Envoy can take advantage of everything that Kubernetes knows about the service, which results in less config to write and a more robust application. (For example, it knows about backends that Kubernetes considers unready -- if all your backends are unready, Envoy will "panic" and try sending them traffic anyway. This can result in less downtime if your readiness check is broken or there's a lot of churn during a big rollout.)
Btw not sure if you read till the end, they are actually in the process of migrating to Envoy.
So yes, it is :)
However, as far as I can read it, they have somewhat different views on the root cause?
"Soon, it became clear we had stale HAProxy configuration files, as a result of linting errors preventing re-rendering of the configuration."
"The program which synced the host list generated by consul template with the HAProxy server state had a bug. It always attempted to find a slot for new webapp instances before it freed slots taken up by old webapp instances that were no longer running. This program began to fail and exit early because it was unable to find any empty slots, meaning that the running HAProxy instances weren’t getting their state updated. As the day passed and the webapp autoscaling group scaled up and down, the list of backends in the HAProxy state became more and more stale."
Maybe a combination of the two?
They have a tool listening to applications starting and shutting down. It's adjusting the configuration live while running to remove shut down instances (free a slot) and put in newer instances (find a free slot and reconfigure).
From the explanation on that day, there were more instances than usual due to high load. It seems the tool was looking for a free slot at some point and found none and crashed.
I'd say, it's an issue with capacity planning because they didn't plan enough slots for their infra on high load and an issue with the tool because it shouldn't fail silently when out of slots.
Eventually, seemingly sane assumptions become anachronistic laughing points. (Even if they're apocryphal...)
(Also, maybe you should look up the definition of "apocryphal? I know he never said it, and strongly alluded to that, and didn't attribute it to Bill for that reason...)
It sounds like an issue with naming the failure pattern rather than understanding it. The root cause was equivalent to a memory leak in their custom auto scaling process; machine instances were not being freed (an “instance leak”). The fixed resource limit was self-induced by a hard-coded ratio between the number of proxy servers to web servers.
Historically, the fixed ratio never reached a point where the “instance leak” caused failures but on one specific “Terrible, Horrible, No-Good, Very Bad Day” it failed badly.
1. Stale configs led to an overabundance of web apps, and then
2. Old instances of the web app couldn’t be removed because of the consul-template bug.
so, yes, a combination (in sequence) of the two.
Hard for me to be sure because I’m by no means knowledgeable on this stuff.
- slots full
- to update slots with a new host you need an empty slot
- hosts went away but updating config was impossible -> errors because config referenced non-existing hosts
- monitoring was broke so we didn't learn about it until it was too late
I truly don’t understand the cycle of churn. Once most edge cases and bugs of Haproxy have been found, the right decision is not migrating to completely unknown territory again. No project is a silver bullet, and changing stacks after you find the bugs makes for a terrible return on your bug-hunting investments.
But it also sounds like Envoy and HAProxy have fundamentally different approaches to service discovery:
If I were brought in as a consultant on this, my first question would be: why are you using a fleet of HAProxies instead of the ALB? I'm not saying that's a bad choice, but I'd want to know why that choice was made.
The second question I would ask is what kind of Chaos Engineering they are doing. Are they doing major traffic failover tests? Rapid upscaling tests? Random terminations?
Those are probably the first two things I'd want to solve.
That said though, it does do it's job for the most part.
> The program which synced the host list generated by consul template with the HAProxy server state had a bug. It always attempted to find a slot for new webapp instances before it freed slots taken up by old webapp instances that were no longer running. This program began to fail and exit early because it was unable to find any empty slots, meaning that the running HAProxy instances weren’t getting their state updated
The broken monitoring hadn’t been noticed partly because this system ‘just worked’ for a long time, and didn’t require any change.
Do them regularly, and keep your playbook of mock failure scenarios up to date with good coverage of all your systems. It's especially critical for disaster recovery (a DR plan that's never tested isn't worth the paper it's written on).
Consider going one step further and randomly injecting artificial failures into production shards, so handling them becomes a regular affair. When you build out monitoring, that might be a good time to think about the kind of stimulus that would be effective at exercising those monitors/metrics/alerts (think unit testing analogy). You can automate an evil adversary bot, or have humans do Red team / Blue team challenges. Yes, you're limited how much you can break without too severely impacting production, but if you've designed enough redundancy you should be able to achieve reasonable coverage. The more you can engineer simulated failures into your regular workflow, the less of a big deal it'll be when real ones occur.
Instead of watching the watchers, engage them with stimulus that keeps them sharp and prevents them from getting bored.
All those man hours cost money. Is it worth it? Depends on how much downtime costs you.
That caused quite a few problems...How much is it worth to them after the fact? I'd say a fair bit.
If you are running a communications platform like Slack or Gmail, it's hella worth it.
Sure, the chaos monkey could kill haproxy-server-state-management but that wouldn't uncover the bug in question — it'd just demonstrate that without it running HA proxy's view of the world goes stale, which anyone would expect. Triggering the bug would require reducing the number of HA proxy slots below the number of webapps running for many hours. This is clearly something chaos engineering could do but IMO it's highly unlikely anyone would think to do this. If they thought of this they would also have thought about adding tests that caught such as issue long before the code went into production.
In my experience chaos engineering is often only as good as the amount of thought put into the things it does. Killing processes here and there can be useful but it often won't expose the kind of when-the-stars-align issues that take down infrastructures.
It looks like a classic lack of monitoring, as the article says. Alerting on webapps > slots, early exits, or differing views of the number of webapps up would have likely caught this.
No it won't. But it would uncover their missing alerts for a critical platform component. Their issue was exacerbated by the fact that state-management kept failing for nearly 12 hours and no one noticed.
I'm not totally against chaos testing. I just haven't seen it done well and think it's actually pretty hard to pull off (particularly the non-technical aspect of convincing people it's okay to let this thing go mad). I'd love to see how effective it was within Netflix.
...or just "availability testing" or simply "testing".
Yet, testing systemic failure modes at scale is way more tricky than shutting down some VMs or some network devices.
For example: saturating the uplink bandwidth on a whole datacenter.
It seems like it would take a superhuman level of foresight to catch this scenario for chaos engineering.
Long story. Incompetent (and dishonest) IT person. In fact, I was attacked for “harassing” the IT person.
Nothing was done about it, until the HR DB got borked, and there was no backup.
DR and backup are profoundly unpopular topics. They tend to be expensive, and difficult to test. They also presuppose a Very Bad Thing happening, which no one wants to think about.
I have scars.
I now have multiple layers of backup, and keep an eye on them. I can’t always test every aspect, and have to have faith in cloud providers, but I make sure that, even if we have a tornado, I would still be able to recover the important bits.
Backups always work, it's the restores that have problems
We don't have any personal information in our production database, but even if we did, as long as the QA is thoroughly prevented from interacting with the outside world, it can't hurt to use production data right?
Many more people that should not have access to that data, in bulk or otherwise suddenly have access which does an end run around all your - hopefully - carefully crafted security.
Keep in mind that what you think isn't privacy sensitive may very well be radioactive when combined with other data.
Using it directly is full of dangers, not only of leaking information, but also of corrupting your backups. (Otherwise, why are you testing anything?) And deriving test data from production looks like a good thing to me, but make sure to restrict the access to the test environment and mask your data.
If you really must use real data make sure that (1) you have explicit direction to do this from those higher up in the food chain so that if there ever is a breach of your test systems you don't end up holding the bag and (2) anonymize a copy of the data before it leaves the production system. That way you minimize the risk. But better: simply don't do it, generate your test data.
In that process you would go, "Ok, the service is failed and I don't know why. How does the monitoring look?", and then you'd notice it was broken. This is best performed by the person on the team with the least experience, so they ask the most questions, which reveals more.
What's really funny is when the recovery you expected to work doesn't work, and then you have bigger problems... But at least you planned for it :)
For, let's say, a b2b saas
They notify the customers first? Like, "we'll a little bit sabotage our, well, your, servers this weekend, to find out if they fail and shutdown completely and cannot start again"
It's up to the business to define SLAs and SLOs that the customer will be satisfied with.
I'll do that (some day), thanks
In the past I did both. Always emit a 0 datapoint every period + treat missing datapoints as breaching to discover if an application wasnt consistently emitting metrics. In addition a lower severity Insufficient Data alert was used to discover/validate when a meteic stream literally didnt exist (normally through misconfiguration of metric & alarm dimensions).
In this case, one check could be for max age of actually deployed load balancer config, the other could be for new webapp instances getting traffic.
It's unlikely that both monitoring checks fail to trigger, and if only one of them triggers, one should investigate why the other didn't trigger.
Of course, that requires some decent amount of awareness from the on-call engineer (and/or an explicit step for that in the playbook); far from perfect.
You have to randomly simulate failure cases of all shapes and sizes to be prepared for failure cases of all shapes and sizes.
You don't know if your plan will survive contact until it's survived contact.
You would need multiple watchers who watch one another. Of course, if all watchers die at the same time, you're out of luck.
This is only part of the solution though, because in order to write and test effective alerting rules, you need to repeatedly and frequently test them with either real or simulated failures, and it can get expensive.
I would have used some Heartbeatservice but thats not feasible in our environment.
You would set up an integration test to trigger the state that results in this alert.
I was surprised how they dismissed HAProxy integration with Consul using SRV DNS records. Can anyone confirm the problems they highlight?
It seems like their service that broke would not be needed if they went the DNS route.
Pre k8s and before srv support we used consul template in prod as well but it always scared me, seemed like too many moving pieces for what should've been a simple system.
Because DNS records come back in random order for each response, those truncated dns responses caused the backend slots to constantly rotate between different pod instances. Haproxy was graceful about the rotations, but it showed up as suddenly very strange latency / perf numbers when a backend was scaled up to say 10 instances from the normal 3
There were few bugs in the first implementations, should be good now. Slack don't mention their scale but I can imagine some UDP/DNS edge cases if there are hundreds of instances behind one domain.
Self hosting is not a pancea and I would not think it would be more reliable.
There's also limited reason why you really want your servers in your city, 5ms of latency savings isn't it,instead of economies of scale in large datacentres with good network uplinks and centralized reliability teams.
An autoscaler that keeps chugging when deploys aren't green is outright dangerous.
Deployment broke. Yeah, that probably should have been caught. Even if it didn't, monitoring should have caught the stale load balancer config. It didn't, for some reason unknown to us.
These things happen. Things break. The autoscaler then proceeded to kill customer traffic. That was the part that worked as designed, so another design would have avoided escalating the situation (if you forgive some armchair engineering here).
Sorry, I’m not very familiar with the terminology here; what is the “main webapp tier”?
If you choose to go down the slots road, then you need to put in alerts and discovery for reaching slot limits—-which means monitoring and tracking them, then setting up alerts.
Another limitation is that the ELB only works in AWS to AWS instances in the same location. Gotta use something else for geographic load balancing and for other datacenters.
In the browser? Or a mobile app?
They send 1 api req to server 1, then 1 to server 2 and so on? What about any session cookies maybe tied to a specific server?
Now this approach has drawbacks (DNS responses are cached, and the DNS record picked initially by the client will typically be cached until the app/browser is restarted) but if they are acceptable to you then it's an easy, proven solution.
And then the client picks one, and if that server is offline, picks another
Seems as simple as DNS based? And works with broken server(s)
Unless there's a way to get all ip addrs in js? By custom client code that queries the DNS system?
If the DNS server is down the website wouldn't load at all, but this is an acceptable trade-off considering DNS is a very simple system (not many things can go wrong) and servers can be redundant.
Thanks anyway for replying
I haven't used Amazon's ALB, but with the legacy ELB, they can't speak ALPN. So that means, if you use their load balancer to terminate TLS, you can't use HTTP/2. Their automatic certificate renewal silently broke for us as well; whereas using cert-manager to renew Let's Encrypt certificates continues to work perfectly wherever I use it. (At the very least, cert-manager produces a lot of logs, and Envoy produces a lot of metrics. So at the very least, when it does break, you know what to fix. With the ELB, we had to pray to the Amazon gods that someone would fix our stuff on Saturday morning when we noticed the breakage. They did! But I don't like the dependency.)
I have also used Amazon's Network Load Balancer with EKS. It interacts very weirdly. The IP address that the load balancer takes on changes with the pods that back the service. The way the changes happen is that the NLB updates a DNS record with a 5 minute TTL. So you have a worst case rollout latency of 5 minutes, and there is no mechanism in Kubernetes to keep the old pods alive until all cached DNS records have expired. The result is, by default, 5 minutes of downtime every time you update a deployment. Less than ideal! For that reason, I stuck with ELB pointing to Envoy that terminated TLS and did all complicated routing.
The ALB wouldn't have these problems. It's just a HTTP/2 server that you can configure using their proprietary and untestable language. It has some weak integration with the Kubernetes Ingress type, so in the simplest of simple cases you can avoid their configuration and use a generic thing. But Ingress misses a lot of things that you want to do with HTTP, so in my opinion it causes more problems than it solves. (The integration is weak too. You can serve your domain on Route 53, but if you add an Ingress rule for "foo.example.com", it's not going to create the DNS record for you. It's very minimum-viable-product. You will be writing custom code on top of it, or be doing a lot of manual work. All in all, going to scale to a large organization poorly unless you write a tool to manage it, in which case you might as well write a tool to configure Envoy or whatever.)
In general, I am exceedingly disappointed by Layer 3 load balancers. For someone that only serves HTTPS, it is completely pointless. You should be able to tell browsers, via DNS, where all of your backends are and what algorithm they should use to select one (if 503, try another one, if connect fails, try another one, etc.) But... browsers can't do that, so you have to pretend that you only have one IP address and make that IP address highly available. Google does very well with their Maglev-based VIPs. Amazon is much less impressive, with one IP address per AZ and a hope and a prayer that the browser does the right thing when one AZ blows up. Since AZs rarely blow up, you'll never really know what happens when it does. (Chrome handles it OK.)
For instance: https://github.com/kubernetes-sigs/aws-alb-ingress-controlle...
Also, and you probably already know about this, but it's true that ingress won't create the record automatically for you - but external-dns ( https://github.com/kubernetes-sigs/external-dns ) will - with the correct annotations (pretty simple), external-dns will watch for changes to ingress and publish the dns records on R53 (and many other DNS providers) for you. It works really well for us, even when the subdomain is shared with other infrastructure not managed by itself.
What's with the "terrible, horrible, no-good, very bad" expression I see a lot? It's a reference to something? From googling, it seems to be this , but ... why? Why do people reference it?
Usually you reference some work like this because a) the phrase is unusually creative, or b) the work is unusually memorable. Neither is true here.
How much more successful do you need it to be? Odds seem very high that a kid growing up in the past 50 years was exposed to this story and phrase
I agree it's a lame title.
Don't be a sourpuss because you were exposed to a new meme, that's how echo chambers form.
Also looks like they do blue green but dont confirm the replacements are live before considering the greens new blue.
- The ideal fit for any hosting PaaS is a company who has a large hosting and infra footprint but for whom the technology is _not_ the core competency of the business. Slack is very much better off running their own systems with their own people.
- As someone who deals with customers every day I can tell you that yes - we know our platform specifically and how the internet works generally better than almost 100% of our customers, but we do not know _your application_ at all.
- Many problems in The Cloud are the result of application developers not understanding that there are performance differences between localhost and The Cloud, specifically around IO, or that there are performance differences between the Cloud at 1x and The Cloud at 10x with everything. Systems can run smoothly and applications can still kill themselves because of the interaction between the two.
The idea of requests going to an application server on a TCP or HTTP connection with the application server being able to access a database or datastore is now common enough for all the PaaS providers to have abstractions for it, with the load balancing and database/data stores being managed. If customers aren’t happy with the auto scaling logic that can be overridden, but it seems like everything else is pretty rock solid.
Slack’s USP isn’t to reinvent load balancing, distributed configuration or database management, so if I was running it and had a clean slate a PaaS seems like a better bet.
You're missing the point. A PaaS, or serverless service doesn't need to know your application. That's the whole point. They're just API calls, and they need to succeed with consistently low latency.
That's (a) just one objective and (b) too vaguely stated.
What's an acceptable p95 latency for a FaaS? For a database write? For a message queue? The answer is a very big "it depends". And why are we only talking about latency?
There are tradeoffs literally everywhere you look in this space, and not knowing enough about the application's performance, reliability, and efficiency goals can sometimes be a real hindrance to being good at running that application.
Until something falls over and you haven't deployed any new code and all the vendor's systems are green. Part of the reason you went with a PaaS in the first place is that you didn't want to manage the infra yourself, you just want to ship application code.
This has worked great so far and so you've stopped thinking about the infrastructure at all. Disk and network I/O still exist though, you've just been incentivized to stop thinking about them and so all you know is that the vendor sucks, when in reality it's the application's fault.
Somebody's gotta roll up their sleeves and see what's going on in there though, so hopefully your PaaS vendor is as cool as we are :)
More broadly, almost any application these days is "just API calls". The issue is which API calls, how many, with what frequency and where. That essentially is the application, and a PaaS employee doesn't magically know that stuff instinctively.
I am not saying serverless is expensive, all I am advocating is extensive planning and preparing before adopting any particular serverless solution. Once you give the green light to adopt a specific cloud solution, you tie yourself to that cloud, and that can turn out to be a bad idea in the long run.
Lambda is a great tool when used right. It can take big workloads without costing too much. But in itself it does very little. It still needs to intergrate with something to be triggered, and if you go with AWS API Gateway that can cost you a pretty penny. Load balancer also incurs costs that are difficult to predict and is not so flexible as other load balancing tools, so sometimes you might still need to provision you own service discovery and load distribution layers, just like they had to do at slack.
Serverless is nice, solves a lot of issues and give smaller teams a shot they otherwise wouldn't have, having to manage everything by themselves. But cloud costs can be opaque, cloud implementation can be very complex, cloud solutions can be too rigid sometimes and tying your product to one vendor can be detrimental in the long run.
If you need reliable low latency -- which is probably core to the slack experience, or any GUI in a competitive space, really -- lambda is not a good option. They just don't share your priorities. AWS support will happily waste a lot of your time chasing Just One More Trick to mitigate the problem, though.
It doesn't matter that you're locked-in, with slightly higher than wanted costs when you've failed due to poor priorities.
Become successful, then worry about removing your lock-in if you actually need it (and you probably won't).
Context: I work at a startup that benefits enormously by avoiding AWS/GCP (for most cases) and renting cheap dedicated servers. It is context-dependent; our exact business doesn't benefit much from managed services and really needs big servers.
Your service will likely be more reliable if you use DynamoDB or AuroraDB. Your service will probably be more reliable if you build it in a way that assumes nodes will die at any point, will automatically come back in, and can scale up/down. It'll likely be more reliable if you use SQS rather than your own message bus (and let's be honest, it'll probably be cheaper too).
Yes, you should always evaluate the costs, but reliability and the amount of time you're going to spend maintaining something is something that somehow always gets left out of these evaluations.
An AuroraDB db.r5.xlarge with 10TB of storage, reserved instances 1Y term but no up-front, costs 1,301.40 USD per month.
Take a Hetzner AX161 with 4x3.84 TB SATA SSD, using RAIDZ for 11.52TB usable storage (and 4 times the RAM), at €297.00 per month... so 335.88 USD per month.
That's a difference of 965 $/mo = 11,580 $/yr. If you have 10 of these, they'll pay for a full time sysadmin. Now, that's leaving out a lot of details (bandwith costs and application hosting come to mind), and assumes truly massive databases. On the other hand, as that sysadmin, I promise our databases don't take anything like my full attention, and you really should have some sort of sysadmin/ops team anyways (please do not make devs run your AWS infrastructure; it will end in tears for everyone). Every time this argument comes up, people do mention reliability and time spent on maintenance, but... it's really not bad. Hardware doesn't actually fail that much, postgres isn't that complicated to configure, OS patches aren't that hard to apply. Your mileage will vary, but sometimes it's just not worth using AWS. (And sometimes, it really is; if we didn't need to run oversized databases, I'd push us to use AWS in a heartbeat)
At my company we made the decision to stick primarily with managed dedicated servers over AWS in our very early days. Now we're a decent size a few years later (25 employees) and the cost savings we're realizing are tremendous. We did the math and found that if we had gone with AWS in the early days then we would now conservatively be paying an extra $165,000 on our hosting bill annually.
We still use AWS for some specialized services (e.g., Lex) but the bulk of our stack runs on gear that we now colocate for a fraction of the cost.
If your time is free, and you don't actually need anything resembling high availability for the data in the database, then that's a good price comparison. I'm not arguing that managed databases makes sense for everybody, but if you're doing a price comparison then at least factor in multi-site redundancy for the data?
That's true and fair, although in both directions; skimming the docs it looks like aurora prices include 2 replicas? But backups aren't free (to store), bandwidth isn't free, and iops aren't free. Also, my difficulty in figuring out a fair pricing comparison highlights another point: a dedicated server has a fixed price. Other than more servers for more instances/replicas, you're never going to pay more, and even then it's a simple "adding another replica will increase our costs to X*(N+1) per month", not a "scaling out will add X to our costs, but if we use more I/O than expected we'll add Y to our costs, and exporting data will cost Z in bandwidth".
Again, everything must be planned beforehand. Savings could also be marginal, but it could also be significant. In big enterprises, where billing is north of a couple of millions of dollars, every percentile you can save is justifiable.
This is based on faith — there might, or might not be a specialized 24/7 devops team who runs these things better than you.
My rational mind has trouble accepting things based on faith, which is also why I don't trust RDS: I don't know of any way to run a distributed SQL database without data loss (neither does Jepsen), so why would I expect RDS to do this correctly?
Using those services does provide a warm and fuzzy feeling, though.
I'm not convinced Amazon's team is immune from the sort of complex failure mode described here. 'll bet there's people with equivalent sorts of stories about where edge cases in service interactions (either their own set of Lambda services or the AWS ones behind them, or more likely both) lead to a similar unexpected failure cascade.
You're being way too kind. Not only is AWS not immune, their autoscalers are often absurdly primitive. Like, hourly cron job doubling / halving within narrow safety rails primitive, where it's not merely possible to find a load that trips it up, it's all but inevitable.
This varies by service, but they always project an image of their infrastructure being rather smart, and in the cases where I've been able to make an informed guess about what's actually going on, it's usually wildly inconsistent with the marketing. They don't warn you about the stinkers and even on services with good autoscaling and no true incompatibility between AWS's hidden choices and your needs, your scaling journey will involve periodic downtime as you trip over hidden built in limits and have to beg support to raise them. Sometimes you get curiously high resistance to this, leading to the impression that these aren't so much "safety rails" as hardcoded choices.
Oh, and just last week we managed to completely wedge a service. The combination of a low limit on in-flight processes, two hung processes, immutability on running processes, and delete functionality being predicated on proper termination led to a situation where an AWS service became completely unusable for days while we begged support to log in and clear the hung process. Naturally, this isn't going to count as downtime on any reliability charts, even though it's a known problem and definitely looked a lot like downtime on our end.
We're a small (<10) team with modest needs. AWS lets us do some crazy awesome things, but it really bugs me how reliably they over-promise and under-deliver.
Yep. Very much so. Mostly because I don't have enough personal Lambda-specific warstories to feel confident badmouthing it in the context of this discussion thread. But the bits of AWS I do use are certainly not all rainbows and roses...
I have one app/platform I run that basically sits at a few requests an hour for 11 months of the year, then ramps up to well over 100,000 requests a minute between 8am and 11pm for 14 days. Classic ELB (back in the day) needed quite a lot of preemptive poking and fake generated traffic to be able to ramp up capacity fast enough for the beginning of each day (aELB is somewhat better but still needs juggling). We never even got close to getting autoscaling working nicely on the web and app server plane to let it loose in prod with real credit card billing at risk, we just add significantly over provisioned spot instances for our best estimates of yearly growth (and app behaviour changes) for the two weeks instead, and cautiously babysit things for the duration.
It's nice we can do that. It'd be nicer if I didn't have to keep explaining to suits and C*Os why they can't boast own the golf course that they have an autoscaling backend...
Meanwhile, I need AWS support constantly because their entire platform is a gigantic social experiment in minimum viable products. How crusty are people willing to tolerate? Evidently: very, very crusty.
What I do know about GCP is they had a production bug in their tooling that was breaking everything for literally 100s of customers and they never even bothered replying to the bug report on their support forums. That experience along with the general modus operandi of shutting things down that don't further their surveillance capitalism business model means I won't be trying them again.
I'm convinced they mvp every possible idea because it makes their platform more sticky. The more services you use, the harder it is to leave them for something better. The problem with that is you get 100 half dead zombie services and it's really hard to know which services are actually supported and which aren't.
It's the Amazon equivalent of Google creating 10 different messenger applications only Amazon never kills an old service, they just let it rot forever.
I think this is what leads to a lot of MVP-type services. A large customer clamors for some individual niche feature, AWS implements a MVP version of it, and then the team moves on to the next feature request (which might be for a completely different service, leaving the MVP in a perpetual MVP state).
Yes, we paid for enterprise support, but it's especially good. They even contributed code to third-party open source projects to solve one of our bugs.
I hear often about google's support being terrible, but the enterprise support on google cloud is definitely an exception.
F.D: Aside from giving a talk at google stockholm once, I am not affiliated in any way.
I don’t think RDS would generally fit that concept as I understand it. Aurora’s data store possibly but you choose to use that specifically.
Obviously, there are good aspects of outsourcing devops/admin work. It's a tradeoff, as most things. If you are a struggling startup it's difficult to justify the cost of owning a bunch of hardware and hiring expensive infra people to manage it. However, facebook is probably better off owning their own infraestructure.
In contrast with in-house infrastructure, you can make your stack as simple or as complex as you'd like depending on your needs (a lot of projects can get away with a handful of physical machines all configured manually, no Terraform/Kubernetes/etc) and you control when you make drastic changes that risk breaking things so you can plan them during a time when downtime would be the least damaging to your business.
However, if you're looking from the point of view of an employee trying to justify to your boss why the service is down. I'm sure they'll be more understanding that's not your fault in this case. It's a risk they decided to take.
The only thing that 24/7 crack devops team ensures in a situation like this is that you continue to generate billable workload, even if it's just millions of little serverless pods spinning in a circle from some configuration mishap.
and their full downside.
what, prey tell, downside?
they operate at a scale (since you mentioned it) much larger than yours. you don't benefit from scale beyond what you need, you "only" benefit from the SLA.
you don't get to tell them what to do, to set their priorities on features vs bugs vs performance vs meaningless metric of the day.
you don't get to interact consistently with the same person or same set of staff, to understand their foibles and to nudge effectively.
you don't get to decide what features are critical to you and cannot, ever, ever be cut no matter how otherwise impactful they are on the environment.
you don't get to set the timetable for "events".
it's not the absolute no-brainer you are making it out to be.
that said, i agree that the value is solidly there for those in the fat part of the bell curve.
The design of a wooden table will inform what tools will produce the best version of that table. It may not be the newest power tool; it may end up being chisels, hand planes, winding sticks, a kerf saw and a mallet. If somebody maintains your tools for you they'll stay reliably sharp, but that doesn't lead to a good table unless you pick the right tools and use them the right way.
It's the Atlassian of chat tools. Horrible performance and bad usabillity.
I am on multiple OSS Discord servers with thousands of users, and it works just fine most of the time.
I am on multiple Slack servers with just 10-20 users and it is unbearably slow.
Server-side scaling was only an issue with OSS communities and for hackathons, where >50k people were online.
There's no way that my company would adopt a platform so "fun" in the way Discord tries to be. Animated characters, a logo which looks like a gamepad, etc...
It sucks, because I use Slack for work and mostly Discord for personal use (mostly dev communities for different companies), and Discord is far and away a better experience. If Discord provided a "cleaned-up" offering for orgs like mine (finance), I think it would be an instant hit.
Slack has a more mainstream look and feel which is probably one of the reasons it's preferred by companies.
I guess it was great when it started out, but they're slowly boiling the frog, who is us.
We'll eventually add a more native way of connecting a stream between two Zulip servers; we just want to be sure we do that right; federation done sloppily is asking for a lot of spam/abuse problems down the line.
(I'm the Zulip lead developer)
Discord also doesn't let you set message retention that I'm aware of which is an immediate nonstarter.
End up in one lawsuit where the other party demands a fishing expedition and you'll be real happy that you've got retention limited to 90 days by policy and in practice.
Most need stuff going back at least a little bit. If I was talking to you about something on Friday and wanted to reference the conversation Monday I'd be real mad if the convo was already deleted.
In slack you can configure a retention period for files, messages, or both. Anything older than that gets wiped.
Other than that:
- shared channels between workspaces
- private messages within a workspace, instead of globally
- decoupled accounts from workspaces, so I can use my personal email and work email associated slack workspaces at the same time
- much tighter integration with third party tools eg zoom, webex, etc
Discord is great for casual chats with friends or open source communities. Insufficient for business.
I just don't understand why it's important for text based communication, but not voice.
Also the ability to draw on screen while screen sharing. So simple yet so useful.
I personally didn't like threads and this is the first time someone mentions them favorably.
Imagine how HackerNews, Reddit or any other discussion board would be without threading. Now, if you see a #general Slack room of a 100 person company, you'll quickly see that it would be a mess without threading. That's what happens in Discord.
The central value proposition is its reimagined UI, making threaded replies more visible and more interactive by separating conversation topics (threads) into different columns.
It’s currently just an MVP, but I expect to at least improve the UI and maybe rearchitect the system in the near future.
Any feedback is greatly appreciated!