"7:52 AM PST We are investigating Internet connectivity issues to the US-WEST-1 Region."
Edit: Found root case, maybe?
"8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-1 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery."
"8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-2 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery."
someone tripped over the fiber run i bet. Or, a cleaning person unplugged a router to plugin a vacuum (that actually happened but to a minicomputer iirc)
nah man, it's never the digger that's the idiot. it's always the project manager that told the digger where to dig. just like it's never the dev's fault as the PM made them do it. /s
It's interesting that west-2 was quicker to create the incident (despite the issue starting a bit later there, at least by our experience), and while they both "identified" at the same time, west-2 also waited longer to call it resolved.
I assume there are different teams responsible for each, is the west-2 team just more on top of things?
2. They don't really have much "legacy" stuff to deal with since they likely turn over racks quickly across their whole fleet and software deployments should be standardized, so any US-east-1 flakiness has to do with the fact that its where amazon houses their control planes often.
Yes, it could be network peering related. But there's definitely a lot of us-west-1 and us-west-2 users complaining and people saying that us-east-1 seems fine.
It's still there now, on the top of the page, just marked resolved:
us-west-1:
7:52 AM PST We are investigating Internet connectivity issues to the US-WEST-1 Region.
8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-1 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery.
8:10 AM PST We have resolved the issue affecting Internet connectivity to the US-WEST-1 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.
us-west-2:
7:43 AM PST We are investigating Internet connectivity issues to the US-WEST-2 Region.
8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-2 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery.
8:14 AM PST We have resolved the issue affecting Internet connectivity to the US-WEST-2 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.
That is a shame. Anyone coming in after the fact to investigate an outage or glitch with their systems will need to look harder to find a known AWS outage. We can’t assume everyone looks at HN.
I thought that sounded ridiculous so I did the math and 99.9999999 uptime allows for 1.314 _seconds_ of downtime every 1000 years. It would take approximately 2.7 million years to acquire just an hours worth of allowable downtime, that's how long it takes light from the second nearest spiral galaxy and farthest visible object to the eye in perfect conditions [1]. Within a single quarter of a year, that's 328.5 μs (microseconds) or about 1200 blinks of an eye [2] or about 3 times faster than a typical electric capacitor camera flash [3], also approximately, and interestingly enough, less than 1% of my current ping to my ISP let alone Amazon's servers.
So yeah, having done that I now understand that it was probably a joke but it really puts into perspective just how ridiculous things can get with a few 9's.
User reports — i.e. the number of people who google “is X down” and then click a Down Detector link.
It’s a clever way of getting reasonably accurate data very quickly and easily, though it does have it’s flaws — the data is pretty noisy and users often attribute outages to the wrong service (e.g. blaming their ISP or Microsoft or something when YouTube is down, or vice versa).
I would guess the user is asking what are down detector's dependencies... E.g. can their website function I'd us-east-2 goes down? Or a GCP equivalent? Or are they on a self-hosted server ? What would cause the metrics to be "off"
They really need to stop requiring SVPs or higher to show non-green status on the status page, as other HNers have revealed in last week's AWS post. It's effectively not a status page, and they could probably be sued if it can be demonstrated that X service was down but the status page showed green (since the SLA is based on status page). Should be automated and based on sample deployments running in every region and every service. And they should use non-AWS instances to do the sampling, so they can actually sample when, say, we experience the obligatory black friday us-east-1 outage every year.
I think SVP / GM approval is only needed for yellow / red status. From my time in AWS Support, the Support Oncall and Call Leader / GM delegate worked to approve green-i posts.
If my app won't run for reasons that are not my fault for longer than the SLA guarantees, the affected services should be at least yellow status and I should be accumulating free AWS credits.
With some lame ass tiny blue "connectivity issues" informational text. Surely broken routing to two entire DCs is full red for all services available therein?
Like what, the networking is broken but if you could send packets, the services would still work so they are green?
I was still able to reach our service running in us-west-1 when the connectivity issue was still on-going, so I don't know if it was a full interruption.
I thought Status Pages or Health Pages is designed to automate the reporting and checking the status automatically. This was my impression when I came across those status pages. Apparently, it is not automated and only update it manually. What is the point of having a status pages if it cannot be automated? I'm sure FAANG and tech conglomerates don't want it to be automated because of SLA.
I'm surprised with FAANG hosted their stuff in their competitors cloud services without providing a fallback cloud service if the primary service is down. Sure it cost money but it would be effective this way than putting all eggs in one basket.
As stated earlier, AWS has financial incentive to not update the status page. Nobody is willing to call them on the conflict of interest in a meaningful, market-changing way.
What if everyone has a financial incentive to lie? (They do) Where do we go then? Also, saying "everyone just leave" is a lot easier than everyone "just leaving", but that's tired and repeated. There's a huge mess and tangle of incentives and drawbacks and I don't know if we'd ever get enough support to weed out a service that gets us above the n'th percentile of greatness. As one falls the other will begin to abuse its power, I dont trust any mega Corp to do otherwise. Do you?
Any public communication is handled by people not machines. No one wants to make an automated status page because theres a shit ton of real noise that users dont need to hear about, nd theres a lot of outages that automation won't accurately catch
True, if it is down, then that means AWS is down (not necessarily, obviously). :D But honestly, if they want to monitor AWS, they gotta pick something else for this reason, something that is not down when AWS is.
But we're talking about a status page which should be basically static. In it's simplest form you need a rack in 2+ random colos and a few people to manage the page update framework. Then you make teams submit the tests that are used to validate SLA. Run the tests from a few DCs and rebuild the status page every minute or two.
Maybe add a CDN. This shit isn't rocket science and being able to accurately monitor your own systems from off infrastructure is the one time you should really be separate.
They have a bazillion alexa and kindle devices out there that they could monitor from, heh heh. At least let that phone-home behaviour do something useful, like notice AWS is down.
AWS wouldn't monitor itself from a competitor, of course
Why not? The big tech companies use each other all the time.
For example, set up a new firewall on macOS and you can see how many times Apple pulls data from Amazon or Azure or other competitors' APIs and services.
But the idea that Amazon or Microsoft or Google would host anything at apple is pretty out there.
Apple uses their competitor's services because they can't build their own cloud and host their own shit. The big boys don't use competitors for services they are capable of building themselves.
A similar reason drives businesses to host `status.product.bigcorp` on a different server. And if your product is a cloud then your suggestion makes sense.
Considering the sea of bright green circles, reds might stand out but blues get lost in a fast scroll. Perhaps fade or mute the green icon to improve visibility of non-green which is the interesting information?
Now that they're back up they're not reporting any problems, how is it supposed to work? It looks like it is just repeating the status reported on the Amazon status page.
I wonder if AWS will make more or less money from these outages?
Will large players flee because of excessive instability? Or will smaller players go from single-AZ to more expensive multi-AZ?
My guess is that no-one will leave and lots of single-AZ tenants who should be multi-AZ will use this as the impetus to do it.
Honestly, having events like this is probably good for the overall resilience of distributed systems. It's like an immune system, you don't usually fail in the same way repeatedly.
We (Netflix) begged them for years to create a Chaos Monkey that we could pay for. There were things we just couldn't do ourselves, like simulate a power pull or just drop all network packets on the bare metal. I guess not enough people asked.
CMaaS sounds amazing for resiliency engineering. There's so much I want to be doing to perturb our stack, but I don't know all the ways stuff can go wrong. Sure I can ddos it, kick services and servers offline, etc, but that's what, a few dozen failure modes? Expertise in chaos would be valuable by itself. Not to mention being able to shake parts of the system I normally can't touch.
Side note: terraform is pretty good for causing various kinds of chaos, deliberately or otherwise.
If my company is any indication, they're going to make more money since everyone will simply check the multi-AZ or multi-region checkboxes they didn't before and throw more money at the problem instead of doing proper resiliency engineering themselves.
It doesn’t matter how much of resiliency engineering you do. Having everything in a single AZ is a risk. If this is acceptable then it’s fine if not you need to think of multi az from day 1.
Auth0 ran in six AZs in two regions[1] and went down today[2], because they picked the wrong two regions. How many regions and AZs should someone pay for before they get reliability?
At a minimum they should have chosen regions not in the same time zone or general geographic area. US-West 1 and US-West 2 might well be safeguarding against a server failure but is not a disaster plan. If your customers are global, choosing multiple continents is probably prudent.
No one just "moves off" AWS. Once your apps are spaghetti coded with lambdas, buckets and all sorts of stuff, it's basically impossible to get off. More than likely, as you noticed, it will increase spending since multi-AZ/multi-region will become the norm.
No -- if they needed to they already would have migrated to a multi-region. If they don't need it -- they won't have. The reason is simple -- it's expensive as you say. I'm not a fanboi or evangelist of AWS either -- I do have pet theories they named their products with shit names in order to make more money by making AWS skills less transferable to Google Cloud etc. S3 should be Amazon FTP, RDS should be Amazon SQL etc.
Not at all the case. It was a regional outage that got Netflix to more than double our AWS spend going multi-region, so that outage netted them millions of extra dollars per year just from Netflix.
You’re underestimating the ability of eng leadership to not take these issues seriously. Only when there’s sufficient pressure from the very top or even the customers it takes a priority.
> There is no possibility that outages are good for AWS.
Do you know how many non-technical CEOs/boards/bosses have told their tech people that they need to go multi-region/cloud because that's what the one-paragraph blog and/or tweet told them to do in response to last weeks event?
This outage is extremely frustrating to me. My company hosts all our apps in gov cloud. Gov Cloud West 1 is also down, but the AWS Gov Cloud status page indicates that everything is healthy and green. I thought AWS's incident response to the East outage last week was that they'd update the status page to better reflect reality.
Down Detector doesn't really detect anything other than people saying "Is [service X] down?" on Twitter, which does mean that Xbox Live appears to be permanently offline if you believe them because the typical user for Xbox Live will declare anything from tripping over their ethernet cable to a tornado levelling their house preventing a connection to mean Xbox Live is down.
It’s still useful if you remove units from the graph and treat it as a sparkline. If there are reliably ~100 Xbox Live complaints on Twitter per hour, then suddenly there are 3000, that’s an outage.
Is it bad that I can almost see that being a quick and dirty MVP to get out the door while you built your own cloud solution?
Raises serious migration and cost issues, but... would be interesting.
I think for some targeted things there might well be "value added" services you could offer to transparently wrap AWS. E.g. a "write-through" S3 wrapper was something I was actually looking at because some clients when I was contracting were very reluctant to trust anything but AWS for durability but at the same time AWS bandwidth costs were so extortionate that renting our own servers from somewhere like Hetzner and then proxying writes both to a local disk and to S3 and serve up from local disk with a fallback to pull a fresh copy from S3 if missing broke even at a quite small number of terabytes transferred each month.
The nice part about something like that is that properly wrapped you can change your durable storage as needed, and can easily even selectively pick "cheaper but less trusted" options for less critical data. It also allows you to leverage AWS features to ride closer to the wire. E.g. to take another example than storage, I've used this to cut the cost of managed hosting by being to spill over onto EC2 instances in the past, allowing you to run at much higher utilisation rate than what you can safely on managed / colo / on-prem servers alone - as a result, ironically the ability to spill over onto EC2 makes EC2 far less competitive in terms of cost to actually run stuff on most of the time.
That's not how DownDetector works. It just relies on reports from users. The real failure case is users not understanding why they can't access whatever end service. Maybe they blame that service, maybe they blame their ISP, maybe they blame something else.
AWS seems to be working for me, but I’ve worked with clients in the US and spectrum internet tended to drop connections to us sporadically, which looks like an outage to our clients but is something we obviously can’t control.
If it's a network issue, it's on their side. I've verified from centurylink, comcast, cogent, he.net, at&t, and verizon - all of them are having issues. This isn't like: Cox is having an outage and just can't get to AWS.
(This is two similarly spec'd boxes on us-east-2 and us-west-2). Looking at GeoIP of connecting clients, the only pattern I can see is the region itself.
I'm wondering the same thing. We have stuff hosted in us-west-2 and multiple people across the US are reporting that our systems are down, however our system is working fine for me here, which is near Toronto.
The east-1 downtime was the interconnection between AWS hosted services, including the control plane, so most resources not dependent on AWS APIs stayed up (eg. non-autoscaled EC2 instances).
I wonder if you really dug into most company's tech stacks, how many of their support tools (e.g., PagerDuty) are reliant on overlapping cloud providers.
Oh man, it is insane. During the aws incident last week we couldn't build software because bitbucket pipelines were all down, due to them running lambdas in us-east-1 only haha.
We've taken a massive turn away from a "decentralized" internet.
Yea a number of people got hit by that, Louis Rossmann found out that every form of contact to his buisness was reliant on AWS east 1. https://www.youtube.com/watch?v=DE05jXUZ-FY
I'm so glad that I'm not still the CTO of a startup. I would be getting dozens of e-mails from people without engineering backgrounds asking "Are we multi-cloud", "why didn't you make us multi-cloud"?
The response is that this actually works well enough, so the investment required has not pushed anyone to do it (with that meaning building the core infrastructure to make that easy).
There was a brief period of time back in the early 90's where I felt I understood how Linux worked -- the kernel, startup scripts, drivers, processors, boot tools, etc... I could actually work on all levels of the system to some degree. Those days are long gone. I am far removed from many details of the systems I use today. I used to do a lot of assembly programming on multiple systems. Today I am not sure how most of the systems works in much detail.
To an extent, this is one of the goals, to free up engineers to work on higher level things. Whether it meets that goal in some cases is debatable, and it’s certainly not ideal for us engineers who like to get to the bottom of things.
“working on higher level things” currently implies that depending on many layers of opaque and unreliable lower level hardware and software abstractions is a good idea. I think it is a mistake.
The best conclusion I can come to is "sometimes it works, sometimes it doesn't". Depends on the context. I've seen cases where it works great and other times where it's a huge hassle.
Funny, I feel the exact opposite way. The low level stuff is where all the magic happens, where performance improvements can scale by orders of magnitude rather than linearly with a CTO’s budget. I’d much rather figure out how to condense some over-engineered distributed solution down to one machine with resources to spare.
Every time Github went down multiple people post on HN saying "every since they were bought by Microsoft, ...". As annoying as those Rust evangelists on every single memory corruption bug.
I could have written the OP message a year ago -- I used to feel the same way.
Plz don't disparage Rust evangelism!
Rust is awesome. yes it is complex, frequently annoying, easy to learn difficult to master. I'm speaking from a 30 year dev career.
a few months ago I intended to do a quick investigation into RUST to validate my "i really don't need to learn this" specifically for an embedded project. Within a few hours I found I had become a zealot. Rust has too many "omg, i should tell everybody about this" behaviors that I can't even find my favorite aspect yet.
It's equivalent to a lost soul finding Christianity and accepting the lords blessing and forgiveness! The weight that is lifted of being forgiven to your sins resulting == no more guilt, it's all forgiven! immediately reduction of cognitive dissonance. in this example with rust, it's pointer tracking and memory management, but it's basically the same thing. Rust is for the pious developer.
Those people who are still using C++ for fresh starts are the same folks who love to do things the hard & wrong way, or at least those who don't know any better, infidels, unwashed heathen.
While I'm not sure whether you're serious ;), to be clear: what annoys me is they don't really understand why we are having "someone pwned your phone via a series of memory corruption bugs" daily.
Until those Rust evangelists managed to rewrite the world with Rust (and I promise you there still will be a lot of security bugs), we still have to fix our shit in a low-cost way and their evangelism does not help at all and is pure annoyance.
Taking badges out of the cloud reduces points of failure by several orders of magnitude.
Cloud-based badges make sense if you have locations with small staffs and no HR people or managers. Like if you're controlling access to a microwave tower on the top of a mountain.
But badges-in-the-cloud for an office building full of people who are being supervised by supposedly trusted managers, and all of whom has been vetted for security and by HR, is just being cheap.
Like the 1980's AT&T commercials used to say: "You get what you pay for."
> Taking badges out of the cloud reduces points of failure by several orders of magnitude.
I'm not convinced that's true, or at least certainly not an order of magnitude. Wouldn't a badge system hosted on-prem also need a user management system (database), a hosted management interface, have a dependency on the LAN, and need most of the same hardware? Such a system would also need to be running on a local server(s), which introduces points of failure around power continuity/surges, physical security, ongoing maintenance, etc.
All of those things would also be needed by the cloud provider, too. Just because it's on-prem doesn't mean it doesn't need servers, power conditioning, physical security, etc. "Cloud" isn't magic fairies. It's just renting someone else's points of failure.
In addition, you're forgetting the thousands of points of failure between the building and the cloud provider. Everything from routers being DDOSed by script kiddies to ransomware gangs attacking infrastructure to Phil McCracken slicing a fiber line with his new post hole digger.
The remote solution requires all of those same things, plus in addition it requires internet connectivity to be up and reliable, the cloud provider be available and the third party company be up and still in business.
Adding complexity and moving parts never reduces points of failure. It can reduce daily operating worries as long as everything works, but it can't reduce points of failure. It also means than someday when it breaks, the root causes will be more opaque.
Within the building’s on premise hosted infrastructure, are they going to buy multiple racks and multiple servers spread far enough apart so that there aren’t many single points of failure that will bring the badge machine down if they fail?
Many logical people have decided to abstract away their soul-crushing anxieties and legal gray area during outages to incredibly stable and well-staffed cloud infrastructure providers.
If you and your team are better at taking care of hardware than an entire building full of highly paid engineering specialists, then that's cool for you, but also, no you're not.
That's not to say you're not capable of running on-prem hardware that is stable.
I'm just saying that the high-handed swiping away of everyone else who's made an incredibly safe and logical decision to host their stuff in the cloud makes me question your general vibe.
> If you and your team are better at taking care of hardware than an entire building full of highly paid engineering specialists
The trade offs aren't quite that simple. Those specialists are necessary because they're building and maintaining infrastructure that's extremely complex since it has a crazy scale and has to be all things to all people. When you're running in-house, your infrastructure is simpler because it's custom tailored to your specific requirements and scale.
There are tradeoffs that make cloud vs local make sense in different contexts and there's no one right answer.
If you plan to replicate all of AWS I'd agree with you. But if all you need is a handful of servers, you could end up with better uptime doing it in-house just because you don't have all the moving parts that make AWS tick, reducing the chance for something to go wrong.
My bare-metal servers stayed up during both of the recent outages, not because I'm some kind of genius that's better than the AWS engineers but just because it's a dead simple stack that has zero moving parts and my project doesn't require anything more complex.
There is absolutely no reason for a local device (like a door lock or dishwasher as per OP) to depend on any external connectivity. Not to the company on-prem hardware, not to AWS.
Yep, it's broken again. I was trying to install some Thunderbird extensions, and stuff started breaking halfway through. Never thought of an AWS outage borking my mail client I guess...
I hear they do get people who want to be able to get experience at AWS's scale, there's only a few places for that.
The thing that really gets me is the reports from the last major outage a few days ago about how pervasive lying inside the company is. This really doesn't work well for engineering and we're possibly seeing the results of that. We should certainly expect to see that becoming visible the more time goes on without a major cultural shift. Which given that the guy who ran AWS now runs all of Amazon.com....
Multi-cloud is such an odd idea to me. You're either building abstractions on top of things like cloud-provider specific implementations of CDNs, K8S, S3, Postgres, etc...or using the cloud just for VMs. The latter would be cheaper with just old-school hosting from Equinix, Rackspace, etc. The former feels like a losing battle.
It’s prompted discussions of building multi regional services in my org but not multi cloud. They would have to really really really screw up for that to happen… maybe be down for like a week or something.
Reminder that the internet was literally invented to avoid this kind of nuclear attack. But i guess people are herdish animals and prefer to die as a group
More like ultimately all these companies buy into a certain form of vendor lock-in and they have no competence or willingness to migrate or even consider the competition. It's starts with "oh I'm just renting a remote virtual server" and in no time it's "Oh, all my stack is tied to AWS proprietary products" because convenience. That's what Amazon wants.
Takes minutes to update a CloudFront distribution (they say around 5 minutes in their blog post from last year when speed was improved [1]). I think they might want to be able to change it to "everything's back to normal" in an instant, based on the SLA argument I've seen thrown around last time an AWS region was down.
I'm seeing outages on us-west-2 too. Customer facing traffic being served through Route53 -> ALB -> EC2 is down and CLI tools are failing to connect to AWS too.
I'm adding "reliable" into that mix. Too bad they're too expensive and hard to setup for side projects, but HN is probably one of the most stable site I frequently visit, and I don't even think about it.
I disagree that they're expensive. Expensive to own maybe, but you can rent them on a monthly basis from something like Hetzner or OVH for a fraction of the cost of AWS (especially when you include bandwidth which is free and unmetered in this case) and they handle hardware maintenance for you.
Hard to setup is relative. It all depends on what you're doing and how much reliability you need. For a side project or a dev server you can just start with Debian, stick to packaged software (most language runtimes and services such as Postgres or Redis are available) as much as possible and call it a day. You can even enable auto-updates on such a stable distro.
The knowledge you'll gain by dealing with bare-metal is also going to be useful in the cloud even in container environments.
HN definitely gets overloaded at times, including during big outages when everyone stampedes here. I got a bunch of "sorry, we can't serve your request" a little while back.
No; a copy of the HN database is synced regularly to Firebase (https://github.com/HackerNews/API), but IIRC the site itself runs on a single process on a single machine with a standby ready.
Tangentially related: On Friday Backblaze and B2 were down for 10+ hours to update their systems for the log4j2 vulnerability. Seemed noteworthy for the HN crowd and I posted a link to their announcement when the outage began. However, the post was quickly flagged and disappeared. Genuinely curious, why is announcing some outages ok and others not?
What would be the ratio of HNers who are Backblaze customers vs those who are AWS customers. I bet Backblaze number is small enough where Backblaze employees on HN can downvote you enough for it to matter.
Multi-region is difficult and expensive, and a lot of projects aren't that important. Most of our infrastructure just isn't that vital; we'd rather take the occasional outage than spend the time and money implementing the sort of active-active multi-region infrastructure that a "correct" implementation would use. We took the recent 8 hour us-east-1 outage on the nose and have not reconsidered this plan. It was a calculated risk that we still believe we're on the right side of. Multi-AZ but single-region is a reasonable balance of cost, difficulty, and reliability for us.
I have some services which can cope with a 98.5% downtime, as long as they are available the specific 1.5% of the time we need them to run, as such "the cloud" is useless for that service
Right when you really want your thing to be up and can’t amortize hours of continuous downtime cloud has no solution for this. That’s something that often gets left out from the sales pitches tho =)
Depends on how critical they are to your stack. Ime if you use more than a few products and either one of them can take you down yeah it’s less than 3. Just something to ponder but if s3 didn’t meet 99.9 for the month you get a whopping 10% back. Other cloud vendors aren’t much better at this (actually worse). Not even to mention that you need to leave some room for your own fuckups
IDK, don't you end up with a bunch of extra costs? Like you're going to literally pay more money because now you have cross region replication charges, and then you're going to pay a latency cost, and then you may end up needing to overprovision your compute, etc.
All to go from, idk, 99.9% uptime to 99.95% (throwing out these numbers)? The thing is when AWS goes down so much of the internet goes down that companies don't really get called out individually.
You're saying that as if it's a walk in the park to set up and not cost prohibitive, in terms of opportunity cost and budget, especially for smaller companies.
Right. Downtime (or perception of downtime) is bad for business, so AWS is surely working to improve reliability to avoid more black eyes on their uptime. But at the same time, an AWS customer might be considering multi-region functionality in AWS to protect themselves ... from AWS making a mistake.
As a customer, it's unclear what the right approach is. Invest more with your vendor who caused the problem in the first place, or trust that they'll improve uptime?
An honest question. Why do you guys use AWS instead of dedicated servers? It's terribly expensive in comparison, nowadays equally complex, scalability is not magic and you need proper configuration either way, plus now the outages become more and more common. Frankly, I see no reason.
Once you have committed to a certain way of doing things, the transition costs can be very high.
Let's consider RockCo and CloudCo. They both provide a B2B SAAS that is mostly used interactively during the working day, and mostly used via API calls for the rest of the working week. Demand is very much lower on weekends. Both RockCo and CloudCo were founded with a team of six people: a CEO who does sales, a CTO who can do lots of technology things, three general software developers, and one person who manages cloud services (for CloudCo) or wrangles systems and hosting (for RockCo).
In the first year, CloudCo spends less on computing than RockCo does, because CloudCo can buy spot instances of VMs in a few minutes and then stop paying for them when the job is done. RockCo needs a month to signficantly change capacity, but once they've bought it, it is relatively cheap to maintain.
In the second year, they are both growing. CloudCo buys more average capacity, but is still seeing lots of dynamic changes. RockCo keeps growing capacity.
In the third year, they're still growing. CloudCo is noticing that their bills are really high, but all of their infrastructure is oriented to dynamic allocation. They start finding places where it makes sense to keep more VMs around all the time, which cuts the costs a little. RockCo can't absorb a dynamic swing, but their bills are now significantly lower every month than CloudCo's bills, and the machines that they bought two years ago are still quite competitive. A four year replacement cycle is deemed reasonable, with capacity still growing. And bandwidth for RockCo is much cheaper than the same bandwidth for CloudCo.
Who's going to win?
Well, you can't tell. If they both got unexpectedly sudden growth surges, RockCo might not have been able to keep up. If they both got unexpected lulls, CloudCo might have been able to reduce spending temporarily. RockCo spent more up front but much less over the long term. CloudCo could have avoided hiring their cloud administrator for several months at the beginning. RockCo's systems and network engineer is not cheap. And so on, and so forth.
https://downdetector.com/status/aws-amazon-web-services/