"7:52 AM PST We are investigating Internet connectivity issues to the US-WEST-1 Region."
Edit: Found root case, maybe?
"8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-1 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery."
"8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-2 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery."
someone tripped over the fiber run i bet. Or, a cleaning person unplugged a router to plugin a vacuum (that actually happened but to a minicomputer iirc)
nah man, it's never the digger that's the idiot. it's always the project manager that told the digger where to dig. just like it's never the dev's fault as the PM made them do it. /s
It's interesting that west-2 was quicker to create the incident (despite the issue starting a bit later there, at least by our experience), and while they both "identified" at the same time, west-2 also waited longer to call it resolved.
I assume there are different teams responsible for each, is the west-2 team just more on top of things?
2. They don't really have much "legacy" stuff to deal with since they likely turn over racks quickly across their whole fleet and software deployments should be standardized, so any US-east-1 flakiness has to do with the fact that its where amazon houses their control planes often.
Yes, it could be network peering related. But there's definitely a lot of us-west-1 and us-west-2 users complaining and people saying that us-east-1 seems fine.
It's still there now, on the top of the page, just marked resolved:
us-west-1:
7:52 AM PST We are investigating Internet connectivity issues to the US-WEST-1 Region.
8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-1 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery.
8:10 AM PST We have resolved the issue affecting Internet connectivity to the US-WEST-1 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.
us-west-2:
7:43 AM PST We are investigating Internet connectivity issues to the US-WEST-2 Region.
8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-2 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery.
8:14 AM PST We have resolved the issue affecting Internet connectivity to the US-WEST-2 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.
That is a shame. Anyone coming in after the fact to investigate an outage or glitch with their systems will need to look harder to find a known AWS outage. We can’t assume everyone looks at HN.
I thought that sounded ridiculous so I did the math and 99.9999999 uptime allows for 1.314 _seconds_ of downtime every 1000 years. It would take approximately 2.7 million years to acquire just an hours worth of allowable downtime, that's how long it takes light from the second nearest spiral galaxy and farthest visible object to the eye in perfect conditions [1]. Within a single quarter of a year, that's 328.5 μs (microseconds) or about 1200 blinks of an eye [2] or about 3 times faster than a typical electric capacitor camera flash [3], also approximately, and interestingly enough, less than 1% of my current ping to my ISP let alone Amazon's servers.
So yeah, having done that I now understand that it was probably a joke but it really puts into perspective just how ridiculous things can get with a few 9's.
User reports — i.e. the number of people who google “is X down” and then click a Down Detector link.
It’s a clever way of getting reasonably accurate data very quickly and easily, though it does have it’s flaws — the data is pretty noisy and users often attribute outages to the wrong service (e.g. blaming their ISP or Microsoft or something when YouTube is down, or vice versa).
I would guess the user is asking what are down detector's dependencies... E.g. can their website function I'd us-east-2 goes down? Or a GCP equivalent? Or are they on a self-hosted server ? What would cause the metrics to be "off"
They really need to stop requiring SVPs or higher to show non-green status on the status page, as other HNers have revealed in last week's AWS post. It's effectively not a status page, and they could probably be sued if it can be demonstrated that X service was down but the status page showed green (since the SLA is based on status page). Should be automated and based on sample deployments running in every region and every service. And they should use non-AWS instances to do the sampling, so they can actually sample when, say, we experience the obligatory black friday us-east-1 outage every year.
I think SVP / GM approval is only needed for yellow / red status. From my time in AWS Support, the Support Oncall and Call Leader / GM delegate worked to approve green-i posts.
If my app won't run for reasons that are not my fault for longer than the SLA guarantees, the affected services should be at least yellow status and I should be accumulating free AWS credits.
With some lame ass tiny blue "connectivity issues" informational text. Surely broken routing to two entire DCs is full red for all services available therein?
Like what, the networking is broken but if you could send packets, the services would still work so they are green?
I was still able to reach our service running in us-west-1 when the connectivity issue was still on-going, so I don't know if it was a full interruption.
I thought Status Pages or Health Pages is designed to automate the reporting and checking the status automatically. This was my impression when I came across those status pages. Apparently, it is not automated and only update it manually. What is the point of having a status pages if it cannot be automated? I'm sure FAANG and tech conglomerates don't want it to be automated because of SLA.
I'm surprised with FAANG hosted their stuff in their competitors cloud services without providing a fallback cloud service if the primary service is down. Sure it cost money but it would be effective this way than putting all eggs in one basket.
As stated earlier, AWS has financial incentive to not update the status page. Nobody is willing to call them on the conflict of interest in a meaningful, market-changing way.
What if everyone has a financial incentive to lie? (They do) Where do we go then? Also, saying "everyone just leave" is a lot easier than everyone "just leaving", but that's tired and repeated. There's a huge mess and tangle of incentives and drawbacks and I don't know if we'd ever get enough support to weed out a service that gets us above the n'th percentile of greatness. As one falls the other will begin to abuse its power, I dont trust any mega Corp to do otherwise. Do you?
Any public communication is handled by people not machines. No one wants to make an automated status page because theres a shit ton of real noise that users dont need to hear about, nd theres a lot of outages that automation won't accurately catch
True, if it is down, then that means AWS is down (not necessarily, obviously). :D But honestly, if they want to monitor AWS, they gotta pick something else for this reason, something that is not down when AWS is.
But we're talking about a status page which should be basically static. In it's simplest form you need a rack in 2+ random colos and a few people to manage the page update framework. Then you make teams submit the tests that are used to validate SLA. Run the tests from a few DCs and rebuild the status page every minute or two.
Maybe add a CDN. This shit isn't rocket science and being able to accurately monitor your own systems from off infrastructure is the one time you should really be separate.
They have a bazillion alexa and kindle devices out there that they could monitor from, heh heh. At least let that phone-home behaviour do something useful, like notice AWS is down.
AWS wouldn't monitor itself from a competitor, of course
Why not? The big tech companies use each other all the time.
For example, set up a new firewall on macOS and you can see how many times Apple pulls data from Amazon or Azure or other competitors' APIs and services.
But the idea that Amazon or Microsoft or Google would host anything at apple is pretty out there.
Apple uses their competitor's services because they can't build their own cloud and host their own shit. The big boys don't use competitors for services they are capable of building themselves.
A similar reason drives businesses to host `status.product.bigcorp` on a different server. And if your product is a cloud then your suggestion makes sense.
Considering the sea of bright green circles, reds might stand out but blues get lost in a fast scroll. Perhaps fade or mute the green icon to improve visibility of non-green which is the interesting information?
Now that they're back up they're not reporting any problems, how is it supposed to work? It looks like it is just repeating the status reported on the Amazon status page.
I wonder if AWS will make more or less money from these outages?
Will large players flee because of excessive instability? Or will smaller players go from single-AZ to more expensive multi-AZ?
My guess is that no-one will leave and lots of single-AZ tenants who should be multi-AZ will use this as the impetus to do it.
Honestly, having events like this is probably good for the overall resilience of distributed systems. It's like an immune system, you don't usually fail in the same way repeatedly.
We (Netflix) begged them for years to create a Chaos Monkey that we could pay for. There were things we just couldn't do ourselves, like simulate a power pull or just drop all network packets on the bare metal. I guess not enough people asked.
CMaaS sounds amazing for resiliency engineering. There's so much I want to be doing to perturb our stack, but I don't know all the ways stuff can go wrong. Sure I can ddos it, kick services and servers offline, etc, but that's what, a few dozen failure modes? Expertise in chaos would be valuable by itself. Not to mention being able to shake parts of the system I normally can't touch.
Side note: terraform is pretty good for causing various kinds of chaos, deliberately or otherwise.
If my company is any indication, they're going to make more money since everyone will simply check the multi-AZ or multi-region checkboxes they didn't before and throw more money at the problem instead of doing proper resiliency engineering themselves.
It doesn’t matter how much of resiliency engineering you do. Having everything in a single AZ is a risk. If this is acceptable then it’s fine if not you need to think of multi az from day 1.
Auth0 ran in six AZs in two regions[1] and went down today[2], because they picked the wrong two regions. How many regions and AZs should someone pay for before they get reliability?
At a minimum they should have chosen regions not in the same time zone or general geographic area. US-West 1 and US-West 2 might well be safeguarding against a server failure but is not a disaster plan. If your customers are global, choosing multiple continents is probably prudent.
No one just "moves off" AWS. Once your apps are spaghetti coded with lambdas, buckets and all sorts of stuff, it's basically impossible to get off. More than likely, as you noticed, it will increase spending since multi-AZ/multi-region will become the norm.
No -- if they needed to they already would have migrated to a multi-region. If they don't need it -- they won't have. The reason is simple -- it's expensive as you say. I'm not a fanboi or evangelist of AWS either -- I do have pet theories they named their products with shit names in order to make more money by making AWS skills less transferable to Google Cloud etc. S3 should be Amazon FTP, RDS should be Amazon SQL etc.
Not at all the case. It was a regional outage that got Netflix to more than double our AWS spend going multi-region, so that outage netted them millions of extra dollars per year just from Netflix.
You’re underestimating the ability of eng leadership to not take these issues seriously. Only when there’s sufficient pressure from the very top or even the customers it takes a priority.
> There is no possibility that outages are good for AWS.
Do you know how many non-technical CEOs/boards/bosses have told their tech people that they need to go multi-region/cloud because that's what the one-paragraph blog and/or tweet told them to do in response to last weeks event?
This outage is extremely frustrating to me. My company hosts all our apps in gov cloud. Gov Cloud West 1 is also down, but the AWS Gov Cloud status page indicates that everything is healthy and green. I thought AWS's incident response to the East outage last week was that they'd update the status page to better reflect reality.
Down Detector doesn't really detect anything other than people saying "Is [service X] down?" on Twitter, which does mean that Xbox Live appears to be permanently offline if you believe them because the typical user for Xbox Live will declare anything from tripping over their ethernet cable to a tornado levelling their house preventing a connection to mean Xbox Live is down.
It’s still useful if you remove units from the graph and treat it as a sparkline. If there are reliably ~100 Xbox Live complaints on Twitter per hour, then suddenly there are 3000, that’s an outage.
Is it bad that I can almost see that being a quick and dirty MVP to get out the door while you built your own cloud solution?
Raises serious migration and cost issues, but... would be interesting.
I think for some targeted things there might well be "value added" services you could offer to transparently wrap AWS. E.g. a "write-through" S3 wrapper was something I was actually looking at because some clients when I was contracting were very reluctant to trust anything but AWS for durability but at the same time AWS bandwidth costs were so extortionate that renting our own servers from somewhere like Hetzner and then proxying writes both to a local disk and to S3 and serve up from local disk with a fallback to pull a fresh copy from S3 if missing broke even at a quite small number of terabytes transferred each month.
The nice part about something like that is that properly wrapped you can change your durable storage as needed, and can easily even selectively pick "cheaper but less trusted" options for less critical data. It also allows you to leverage AWS features to ride closer to the wire. E.g. to take another example than storage, I've used this to cut the cost of managed hosting by being to spill over onto EC2 instances in the past, allowing you to run at much higher utilisation rate than what you can safely on managed / colo / on-prem servers alone - as a result, ironically the ability to spill over onto EC2 makes EC2 far less competitive in terms of cost to actually run stuff on most of the time.
That's not how DownDetector works. It just relies on reports from users. The real failure case is users not understanding why they can't access whatever end service. Maybe they blame that service, maybe they blame their ISP, maybe they blame something else.
AWS seems to be working for me, but I’ve worked with clients in the US and spectrum internet tended to drop connections to us sporadically, which looks like an outage to our clients but is something we obviously can’t control.
If it's a network issue, it's on their side. I've verified from centurylink, comcast, cogent, he.net, at&t, and verizon - all of them are having issues. This isn't like: Cox is having an outage and just can't get to AWS.
(This is two similarly spec'd boxes on us-east-2 and us-west-2). Looking at GeoIP of connecting clients, the only pattern I can see is the region itself.
I'm wondering the same thing. We have stuff hosted in us-west-2 and multiple people across the US are reporting that our systems are down, however our system is working fine for me here, which is near Toronto.
The east-1 downtime was the interconnection between AWS hosted services, including the control plane, so most resources not dependent on AWS APIs stayed up (eg. non-autoscaled EC2 instances).
I wonder if you really dug into most company's tech stacks, how many of their support tools (e.g., PagerDuty) are reliant on overlapping cloud providers.
Oh man, it is insane. During the aws incident last week we couldn't build software because bitbucket pipelines were all down, due to them running lambdas in us-east-1 only haha.
We've taken a massive turn away from a "decentralized" internet.
Yea a number of people got hit by that, Louis Rossmann found out that every form of contact to his buisness was reliant on AWS east 1. https://www.youtube.com/watch?v=DE05jXUZ-FY
I'm so glad that I'm not still the CTO of a startup. I would be getting dozens of e-mails from people without engineering backgrounds asking "Are we multi-cloud", "why didn't you make us multi-cloud"?
The response is that this actually works well enough, so the investment required has not pushed anyone to do it (with that meaning building the core infrastructure to make that easy).
There was a brief period of time back in the early 90's where I felt I understood how Linux worked -- the kernel, startup scripts, drivers, processors, boot tools, etc... I could actually work on all levels of the system to some degree. Those days are long gone. I am far removed from many details of the systems I use today. I used to do a lot of assembly programming on multiple systems. Today I am not sure how most of the systems works in much detail.
To an extent, this is one of the goals, to free up engineers to work on higher level things. Whether it meets that goal in some cases is debatable, and it’s certainly not ideal for us engineers who like to get to the bottom of things.
“working on higher level things” currently implies that depending on many layers of opaque and unreliable lower level hardware and software abstractions is a good idea. I think it is a mistake.
The best conclusion I can come to is "sometimes it works, sometimes it doesn't". Depends on the context. I've seen cases where it works great and other times where it's a huge hassle.
Funny, I feel the exact opposite way. The low level stuff is where all the magic happens, where performance improvements can scale by orders of magnitude rather than linearly with a CTO’s budget. I’d much rather figure out how to condense some over-engineered distributed solution down to one machine with resources to spare.
Every time Github went down multiple people post on HN saying "every since they were bought by Microsoft, ...". As annoying as those Rust evangelists on every single memory corruption bug.
I could have written the OP message a year ago -- I used to feel the same way.
Plz don't disparage Rust evangelism!
Rust is awesome. yes it is complex, frequently annoying, easy to learn difficult to master. I'm speaking from a 30 year dev career.
a few months ago I intended to do a quick investigation into RUST to validate my "i really don't need to learn this" specifically for an embedded project. Within a few hours I found I had become a zealot. Rust has too many "omg, i should tell everybody about this" behaviors that I can't even find my favorite aspect yet.
It's equivalent to a lost soul finding Christianity and accepting the lords blessing and forgiveness! The weight that is lifted of being forgiven to your sins resulting == no more guilt, it's all forgiven! immediately reduction of cognitive dissonance. in this example with rust, it's pointer tracking and memory management, but it's basically the same thing. Rust is for the pious developer.
Those people who are still using C++ for fresh starts are the same folks who love to do things the hard & wrong way, or at least those who don't know any better, infidels, unwashed heathen.
While I'm not sure whether you're serious ;), to be clear: what annoys me is they don't really understand why we are having "someone pwned your phone via a series of memory corruption bugs" daily.
Until those Rust evangelists managed to rewrite the world with Rust (and I promise you there still will be a lot of security bugs), we still have to fix our shit in a low-cost way and their evangelism does not help at all and is pure annoyance.
Taking badges out of the cloud reduces points of failure by several orders of magnitude.
Cloud-based badges make sense if you have locations with small staffs and no HR people or managers. Like if you're controlling access to a microwave tower on the top of a mountain.
But badges-in-the-cloud for an office building full of people who are being supervised by supposedly trusted managers, and all of whom has been vetted for security and by HR, is just being cheap.
Like the 1980's AT&T commercials used to say: "You get what you pay for."
> Taking badges out of the cloud reduces points of failure by several orders of magnitude.
I'm not convinced that's true, or at least certainly not an order of magnitude. Wouldn't a badge system hosted on-prem also need a user management system (database), a hosted management interface, have a dependency on the LAN, and need most of the same hardware? Such a system would also need to be running on a local server(s), which introduces points of failure around power continuity/surges, physical security, ongoing maintenance, etc.
All of those things would also be needed by the cloud provider, too. Just because it's on-prem doesn't mean it doesn't need servers, power conditioning, physical security, etc. "Cloud" isn't magic fairies. It's just renting someone else's points of failure.
In addition, you're forgetting the thousands of points of failure between the building and the cloud provider. Everything from routers being DDOSed by script kiddies to ransomware gangs attacking infrastructure to Phil McCracken slicing a fiber line with his new post hole digger.
The remote solution requires all of those same things, plus in addition it requires internet connectivity to be up and reliable, the cloud provider be available and the third party company be up and still in business.
Adding complexity and moving parts never reduces points of failure. It can reduce daily operating worries as long as everything works, but it can't reduce points of failure. It also means than someday when it breaks, the root causes will be more opaque.
Within the building’s on premise hosted infrastructure, are they going to buy multiple racks and multiple servers spread far enough apart so that there aren’t many single points of failure that will bring the badge machine down if they fail?
Many logical people have decided to abstract away their soul-crushing anxieties and legal gray area during outages to incredibly stable and well-staffed cloud infrastructure providers.
If you and your team are better at taking care of hardware than an entire building full of highly paid engineering specialists, then that's cool for you, but also, no you're not.
That's not to say you're not capable of running on-prem hardware that is stable.
I'm just saying that the high-handed swiping away of everyone else who's made an incredibly safe and logical decision to host their stuff in the cloud makes me question your general vibe.
> If you and your team are better at taking care of hardware than an entire building full of highly paid engineering specialists
The trade offs aren't quite that simple. Those specialists are necessary because they're building and maintaining infrastructure that's extremely complex since it has a crazy scale and has to be all things to all people. When you're running in-house, your infrastructure is simpler because it's custom tailored to your specific requirements and scale.
There are tradeoffs that make cloud vs local make sense in different contexts and there's no one right answer.
If you plan to replicate all of AWS I'd agree with you. But if all you need is a handful of servers, you could end up with better uptime doing it in-house just because you don't have all the moving parts that make AWS tick, reducing the chance for something to go wrong.
My bare-metal servers stayed up during both of the recent outages, not because I'm some kind of genius that's better than the AWS engineers but just because it's a dead simple stack that has zero moving parts and my project doesn't require anything more complex.
There is absolutely no reason for a local device (like a door lock or dishwasher as per OP) to depend on any external connectivity. Not to the company on-prem hardware, not to AWS.
Yep, it's broken again. I was trying to install some Thunderbird extensions, and stuff started breaking halfway through. Never thought of an AWS outage borking my mail client I guess...
I hear they do get people who want to be able to get experience at AWS's scale, there's only a few places for that.
The thing that really gets me is the reports from the last major outage a few days ago about how pervasive lying inside the company is. This really doesn't work well for engineering and we're possibly seeing the results of that. We should certainly expect to see that becoming visible the more time goes on without a major cultural shift. Which given that the guy who ran AWS now runs all of Amazon.com....
Multi-cloud is such an odd idea to me. You're either building abstractions on top of things like cloud-provider specific implementations of CDNs, K8S, S3, Postgres, etc...or using the cloud just for VMs. The latter would be cheaper with just old-school hosting from Equinix, Rackspace, etc. The former feels like a losing battle.
It’s prompted discussions of building multi regional services in my org but not multi cloud. They would have to really really really screw up for that to happen… maybe be down for like a week or something.
Reminder that the internet was literally invented to avoid this kind of nuclear attack. But i guess people are herdish animals and prefer to die as a group
More like ultimately all these companies buy into a certain form of vendor lock-in and they have no competence or willingness to migrate or even consider the competition. It's starts with "oh I'm just renting a remote virtual server" and in no time it's "Oh, all my stack is tied to AWS proprietary products" because convenience. That's what Amazon wants.
Takes minutes to update a CloudFront distribution (they say around 5 minutes in their blog post from last year when speed was improved [1]). I think they might want to be able to change it to "everything's back to normal" in an instant, based on the SLA argument I've seen thrown around last time an AWS region was down.
I'm seeing outages on us-west-2 too. Customer facing traffic being served through Route53 -> ALB -> EC2 is down and CLI tools are failing to connect to AWS too.
I'm adding "reliable" into that mix. Too bad they're too expensive and hard to setup for side projects, but HN is probably one of the most stable site I frequently visit, and I don't even think about it.
I disagree that they're expensive. Expensive to own maybe, but you can rent them on a monthly basis from something like Hetzner or OVH for a fraction of the cost of AWS (especially when you include bandwidth which is free and unmetered in this case) and they handle hardware maintenance for you.
Hard to setup is relative. It all depends on what you're doing and how much reliability you need. For a side project or a dev server you can just start with Debian, stick to packaged software (most language runtimes and services such as Postgres or Redis are available) as much as possible and call it a day. You can even enable auto-updates on such a stable distro.
The knowledge you'll gain by dealing with bare-metal is also going to be useful in the cloud even in container environments.
HN definitely gets overloaded at times, including during big outages when everyone stampedes here. I got a bunch of "sorry, we can't serve your request" a little while back.
No; a copy of the HN database is synced regularly to Firebase (https://github.com/HackerNews/API), but IIRC the site itself runs on a single process on a single machine with a standby ready.
Tangentially related: On Friday Backblaze and B2 were down for 10+ hours to update their systems for the log4j2 vulnerability. Seemed noteworthy for the HN crowd and I posted a link to their announcement when the outage began. However, the post was quickly flagged and disappeared. Genuinely curious, why is announcing some outages ok and others not?
What would be the ratio of HNers who are Backblaze customers vs those who are AWS customers. I bet Backblaze number is small enough where Backblaze employees on HN can downvote you enough for it to matter.
Multi-region is difficult and expensive, and a lot of projects aren't that important. Most of our infrastructure just isn't that vital; we'd rather take the occasional outage than spend the time and money implementing the sort of active-active multi-region infrastructure that a "correct" implementation would use. We took the recent 8 hour us-east-1 outage on the nose and have not reconsidered this plan. It was a calculated risk that we still believe we're on the right side of. Multi-AZ but single-region is a reasonable balance of cost, difficulty, and reliability for us.
I have some services which can cope with a 98.5% downtime, as long as they are available the specific 1.5% of the time we need them to run, as such "the cloud" is useless for that service
Right when you really want your thing to be up and can’t amortize hours of continuous downtime cloud has no solution for this. That’s something that often gets left out from the sales pitches tho =)
Depends on how critical they are to your stack. Ime if you use more than a few products and either one of them can take you down yeah it’s less than 3. Just something to ponder but if s3 didn’t meet 99.9 for the month you get a whopping 10% back. Other cloud vendors aren’t much better at this (actually worse). Not even to mention that you need to leave some room for your own fuckups
IDK, don't you end up with a bunch of extra costs? Like you're going to literally pay more money because now you have cross region replication charges, and then you're going to pay a latency cost, and then you may end up needing to overprovision your compute, etc.
All to go from, idk, 99.9% uptime to 99.95% (throwing out these numbers)? The thing is when AWS goes down so much of the internet goes down that companies don't really get called out individually.
You're saying that as if it's a walk in the park to set up and not cost prohibitive, in terms of opportunity cost and budget, especially for smaller companies.
Right. Downtime (or perception of downtime) is bad for business, so AWS is surely working to improve reliability to avoid more black eyes on their uptime. But at the same time, an AWS customer might be considering multi-region functionality in AWS to protect themselves ... from AWS making a mistake.
As a customer, it's unclear what the right approach is. Invest more with your vendor who caused the problem in the first place, or trust that they'll improve uptime?
An honest question. Why do you guys use AWS instead of dedicated servers? It's terribly expensive in comparison, nowadays equally complex, scalability is not magic and you need proper configuration either way, plus now the outages become more and more common. Frankly, I see no reason.
Once you have committed to a certain way of doing things, the transition costs can be very high.
Let's consider RockCo and CloudCo. They both provide a B2B SAAS that is mostly used interactively during the working day, and mostly used via API calls for the rest of the working week. Demand is very much lower on weekends. Both RockCo and CloudCo were founded with a team of six people: a CEO who does sales, a CTO who can do lots of technology things, three general software developers, and one person who manages cloud services (for CloudCo) or wrangles systems and hosting (for RockCo).
In the first year, CloudCo spends less on computing than RockCo does, because CloudCo can buy spot instances of VMs in a few minutes and then stop paying for them when the job is done. RockCo needs a month to signficantly change capacity, but once they've bought it, it is relatively cheap to maintain.
In the second year, they are both growing. CloudCo buys more average capacity, but is still seeing lots of dynamic changes. RockCo keeps growing capacity.
In the third year, they're still growing. CloudCo is noticing that their bills are really high, but all of their infrastructure is oriented to dynamic allocation. They start finding places where it makes sense to keep more VMs around all the time, which cuts the costs a little. RockCo can't absorb a dynamic swing, but their bills are now significantly lower every month than CloudCo's bills, and the machines that they bought two years ago are still quite competitive. A four year replacement cycle is deemed reasonable, with capacity still growing. And bandwidth for RockCo is much cheaper than the same bandwidth for CloudCo.
Who's going to win?
Well, you can't tell. If they both got unexpectedly sudden growth surges, RockCo might not have been able to keep up. If they both got unexpected lulls, CloudCo might have been able to reduce spending temporarily. RockCo spent more up front but much less over the long term. CloudCo could have avoided hiring their cloud administrator for several months at the beginning. RockCo's systems and network engineer is not cheap. And so on, and so forth.
I don't know for sure, but this is generally common because caches get cold.
A lot of websites use a cache in front of databases (or template rendering engines, or many other systems). That cache might evict entries based on time - after 5 minutes, the entry is considered invalid.
But that means that if you have no traffic for 10 minutes, the cache completely empties. Then when traffic returns, it all skips the cache and actually triggers a real hit to the backend - which is now overwhelmed with traffic. The cache protects the backend in normal behavior, but now it's not doing its job, so the backend has many more requests than usual.
In the worst case, those requests are enqueued in a big serial sequence... but the ones at the back of the queue may time out. The client may do something like say "it's taken me 5 seconds and I still don't have a response - I'll abort and retry!" and now you have even _more_ traffic to deal with.
So cold caches and retries can conspire to keep a service down for a long time even after the root cause is fixed.
I'm accustomed with cache-eviction policies based on LRU, age, etc. But in my systems, eviction happens only when (a) the content is known to be invalid, or (b) there's competition for cache space.
IIUC the parent comment, it's describing a policy that evicts entries even (a) and (b) are false. Is that common in the web-hosting / CDN world? Or is age considered a proxy for stale?
Right, age is used as a proxy for stale, because we often don't have anything better.
A lot of web systems work this way - DNS records for example use a "TTL" which means "time to live." If the TTL is 60, then you throw it out of the cache after 60 seconds even if you have room in the cache, and you have no reason to believe it's invalid. This lets independent entities (like a DNS authority) make a change and get it rolled out everywhere.
I think the reason this is common is that proving cache invalidity is so hard, especially with the typical "dumb" cache appliances that are widely used. They just do stuff like cache the response bytes for a particular URL; they might not even understand HTTP beyond interpreting the request's headers, and certainly don't really understand the response.
Crunchyroll seems to barely work at the best of times, and when it does, it's still a mess.
All sorts of issues still unresolved for years, including the ridiculously annoying "Finishes playing season English sub, autoplays first season of German dub, which then gets stuck". Still no profiles (nerfing their super-premium offering). Auto-resume points are unreliable, the Android app is hot garbage at dealing with network disruption...
I can only imagine their back-end is mostly Visual Basic running on a single AWS-powered VM.
Even as a software engineer, I think I could build from primitive materials a couple of battery operated transceivers to replace the signal flags or horsemen for critical communications. A little basic physics and materials science goes a long way.
Seems to be down in a major way. Lots of various AWS services are down. However, so many things depend on AWS that it could just be EC2 is down and it is causing a rippling affect.
Systems manager in eu-central-1 is giving us some issues now, but I am not sure about their internal architecture for it, so maybe needs some us resources?
AWS Global Accelerator not working correctly anymore as well, connections dropped worldwide. Seems like it is managed from us-west-2 and not redundant.
This comment taught me about the existence of Global Accelerator and, somewhat ironically given the context, we decided to deploy it today. Pretty neat! I'll have to keep in mind that I learned about it because of a worldwide outage :) Thanks!
HOST THE GODDAMN STATUS PAGE ON AZURE FOR FUCKS SAKE.
There is zero excuse for this shit. Be professional. Acknowledge reality. It is logically impossible to run your own status page. Trying to do so just wastes everyone else on the internet's time when you have an outage.
They should host their status page on IPFS instead. If you're never going to change the contents of your status page, you might as well put it into immutable storage!
Top of the board showed “Internet Connectivity Issues (Oregon)”
And that was that. The board worked exactly as it should - it immediately explained my missing traffic and kept me up-to-date with the status of the outage on their side.
They should automatically update as well. Currently it is a static "all green" page and might be manually changed if a managet would give his go. Insane.
Given the legal liabilities Amazon has with their SLAs, it may be working exactly as Amazon thinks it should. Whether anybody would agree with that assessment should be obvious.
I kind of think everyone else here understands this very particular problem of a status page running on the same equipment that it's supposed to be monitoring if that equipment goes down, and for whatever reason, you don't.
I understand that. What I’m questioning is whether that is the problem here. Is it? Do you know? I heard it might be an internet provider issue, in which case the status page is not the problem here.
We are barbarians occupying a city built by an advanced civilization, marveling at the hot baths but know nothing about how their builders keep them running. One day, the baths will drain and anyone who remembers how to fill them up will have died.
Many years ago I stood at the window of my comfortable apartment, watching wind and cold rain rage outside.
I thought about my cave men ancestors who during such a storm if they needed water would have to go out and get it, getting themselves soaked.
If I wanted water, the tap in the kitchen would give it to me, in a nice controlled fashion. If I did feel like having water rain down upon me, my shower would do that, again in a controlled fashion, and I could select the water temperature.
If they wanted the cave to be warmer, they had to burn something and deal with the smoke. And they might have to work hard to obtain whatever it is they burn.
If I wanted my apartment warmer, I just had to turn the knob on the thermostat.
They were at the mercy of their environment. My environment is mine to command. I was feeling pretty superior to my cave man ancestors.
Then I realized that I don't know how to build the systems that I was relying on for my supposed superiority, or even how some of them work.
I'm really just a cave man that found a nicer cave.
> Then I realized that I don't know how to build the systems that I was relying on for my supposed superiority, or even how some of them work.
I used to have this joke(?) with my friends: remember Mark Twain's "A Connecticut Yankee in King's Arthur Court"? The titular Yankee basically upends the (faux) medieval society he gets transported to, "inventing" all sorts of technological miracles.
Well, I'm a software developer but don't come from an engineering background (I mean actual engineering, not programming). I don't even understand how electricity or the telephone work (I mean, old fashioned telephones, let alone current mobile networks). If I was transported to 2 or 3 centuries to the past, I wouldn't be able to explain modern technology to other people, let alone actually build it.
I sort of understand how steam machines work, and I could "invent" the printing press. I guess. But anything related to circuitry, electricity, chemistry, engineering of any sort, I wouldn't be able to even begin explaining them to King Arthur.
My introduction to the knights of the round table would go something like this:
"We are questing for the Holy Grail, oh noble stranger from a far away land! How can you help?"
"Depends, which version of Python are you running?"
A light, enjoyable read along these lines is Leo Frankowski's "high tech knight" series, starting with The Cross-time Engineer. The main character -- a real engineer -- gets transported back to medieval Poland, and he knows that he's got ten years either to bug out, or help Poland defend itself from the coming Mongol invasion.
[I only liked the first four books, but that's enough to cover the original story arc]
This shirt annoys me. I get that it is a joke, but the explanations are just so woefully over-simplified, and don't get at the main problem -- materials and manufacturing technology in the past was poor enough that even if you knew the basic physics you'd have no chance of getting, like, material to build a wing out of.
What, not even pinewood and gelatin for ribs and stringers, and some linen cloth plus pine resin and alcohol for doping? Seriously, that's like 1000BC tech level.
Wing is no problem as long as one can calculate how to make it stiff enough and of a right shape.
Inventing the printing press was more difficult than it seems at first. In addition to the idea of unsing movable type significant development of the correct alloys for the types was necessary. The alloy needs to be able to be cast easily and at the same time be durable to be reused for a large enough number of print runs.
In addition the proper ink needs to be developed...
Fun fact: printing rates increased from about 120 sheets/hour to over 1 million over the course of the 19th century. Those began with wooden screw presses that differed little from Gutenberg's to cast iron, rotary, steam and later electric powered, and web (continuous paper feed) presses, and from matrix plates (with individual type set in blocks) to offset Linotype (in which the entire print block was cast as a single sheet through multiple stages from the original matrix characters).
Thought just occurs: the falling characters of the iconic Matrix screen somewhat resemble the individual type elements flowing and falling through a Linotype machine. I don't know if that is a deliberate or incidental reference, but it's an interesting one.
Right, let me amend my statement: I understand how the printing press with movable type works and I would be able to explain it to King Arthur, but I probably wouldn't be able to actually craft the types, inks, etc, and so the annoyed King would have me beheaded.
Even if you knew what to do, convincing the naturally suspicious people back then to trust a strange outsider would be tricky. Then you have to get the right materials.
If I were a bit more clever, or maybe if I was 50 years older and had played with this kind of stuff growing up, I'd probably try to make a spark-gap transmitter. That seems to be in a sweet spot of not requiring too many super clever bits, and having obvious applications.
Also on a similar theme: https://en.wikisource.org/wiki/I,_Pencil (it's intended to be about free market economies, but you can also read it as something about knowing how even simple modern marvels work at all).
At a very general level once you move past subsistence farming you become reliant on society to provide your needs. And in turn provide some value that can only come from spending your time on things other than farming. And that is I suppose how civilization advances. Its kind of funny to work backwards though, because even subsistence farmers are reliant on society for protection -- they are farmers not soldiers after all. I think about this a lot, how important trust is to going anywhere in modern life. And how little choice there is anyways. I also think about how most people don't think about it at all, or very much, and wonder if knowing how fragile we are makes me happier and more productive, or less so.
There's an interesting misconception that humans developed agricultural societies because they achieved better outcomes as individuals. Research shows that hunter-gatherers were healthy and better nourished than humans in early agricultural settlements.
What's probably closer to truth is that many humans were forced to join farming communities. Stronger individuals or tribes probably enslaved others, and then forced them to build and produce.
The patterns of inequity and the march toward hyper-specialization we still see today make sense in that context.
As a tangent, if anyone is interested in that "cavemanness" deep in our DNA, check out the idea of primitive camping. That was my first experience camping, and I expected an idealized tv-ad experience. The trip was not framed as "primitive camping" to me.
I was dealing with intense burnout, stress, ADHD symptoms, immune problems, trouble sleeping... And I was thrown into the desert in the summer with a tent and some beer. It fucking sucked sooo bad. It fucking sucked sooo bad that I forgot every stupid problem I had, because I spent the entire time in survival mode. Setting up camp. Hauling equipment up and down dunes. Staying hydrated in the 100f+ heat. Making food. Making sure my wife and friends were ok. Strategizing how to defend our camp from bugs and psychos.
I really have not had such an existentially-dense experience as that one. And no, I didn't take any mushrooms, as the rest of the group did. I wanted to be lookout. Maybe I come from a long line of hyperaware sentries.
Forced labor was absolutely the norm for the pre-modern state, and provided the bulk of the workforce [1].
AFAIK humanity is yet to produce a society where the majority of farm laborers are fully free to leave the land they work on (whether via having their papers confiscated, their wages held until the season ends, by having transport provided to a remote farm but the trip back withheld etc). We've seen improvements in the degree of freedom, particularly over the past century and especially the past 50 years, but it's still very low compared to urban dwellers.
Absolutely. My wife and I lived for a year in an off-the-grid cabin in some mountains in Mexico.
We had solar panels and a generator we used only when absolutely necessary. We were never without power, but we lived with the constant anxiety of optimizing our energy consumption. Some stuff we could only do during the day and at night we only used devices with batteries.
For a couple of weeks we didn't have running water in the cabin because we were rebuilding our water deposit tower. We used buckets for everything.
That was almost a decade ago and I still feel grateful at having unlimited energy or running water on demand.
I also feel guilty at times when doing power hungry stuff like playing video games, knowing electricity production is by far the biggest driver of climate change.
Absolutely. My wife and I lived for a year in an off-the-grid cabin in some mountains in Mexico.
I think everyone ought to do a week in an RV with no connections to utilities. Not to take away from your story, but a similar scenario comes up when we "dry camp" (no water or electrical connections): resources are not unlimited. We have solar panels, big-ass inverter and big-ass battery to go with it. But if we want lights at night, best not run that 1100W microwave for too long, because the panels won't keep up and the battery isn't that big. We have a built-in generator, but unlike most RV owners, we are loathe to use it. It's almost like a game, and if that generator fires up then we've lost.
You want to let the water run while you brush your teeth? Go right ahead, our water tank is plenty big...oh, wait, but the holding tanks aren't. Shut that tap off before there's dirty water coming up through the shower. Speaking of showers, use the outside shower, as the holding tanks won't hold enough for your 30 minute, piping-hot shower.
Point of it all is that it one quickly learns that it all has to come from somewhere, and it has to go somewhere after you've dirtied it. I'd like to think that it has made the both of us more conscious of our usage.
There's nothing like being at sea, 100+ miles from civilization, reliant on the limited capacity systems on your vessel. You manage your food, you manage your water consumption, fuel, electrical usage, you're closely attuned to the weather, the sea state, the charts. There are no other visible people or people-made objects out to the horizon in all directions. If something breaks, you'd better know how it works and be able to fix it, or go without. It feels very freeing, but also provides a "back to basics" accountability.
Standing under a hot water shower with unlimited water in a spacious home shower afterward feels luxurious.
Or even better, go backpacking in the wilderness. Slightly different set of constraints: you can usually find water (at least where I hike), but carrying all your equipment and food on your back gives you a new perspective on what's "essential".
I lived in Miami during hurricane Wilma and spent like a week without electricity. You realize how quickly things go south without electricity flowing.
The most impressive thing to me is toilets. Just click a button and your waste disappears. Don't know where it goes or how it gets there and pay almost nothing for the privilege.
Toilets are amazing and I feel privileged every time I use one. Girlfriend thinks I'm nuts.
Well, you didn't just find a cave, it was made for you by other people. Interdependence is a hallmark of social species such as Homo Sapiens. Even your caveman ancestors were probably reliant on one another in many ways.
>It seems that someone asked the great anthropologist, Margaret Mead, “What is the first sign you look for to tell of an ancient civilization?” The interviewer had in mind a tool or article of clothing. Ms. Mead surprised him by answering, “a healed femur (thigh bone)”. When someone breaks a femur, they can’t survive to hunt, fish or escape enemies unless they have help from someone else. Thus, a healed femur indicates that someone else helped that person, rather than abandoning them and saving only themselves.
Not to mention that many of the skills needed by the original cavemen to survive are gone in today's society. In other words, if we were to compete with the original cavemen in their environment, we would most likely fare rather poorly, at least in the short term.
Not trying to glorify off-the-grid living or anything, but I think it's interesting to think that in some (very specific) ways, the cavemen were actually superior to us.
> I'm really just a cave man that found a nicer cave.
You aren't really - most cavemen didn't even understand that fire is possible, and wouldn't be able to consistently operate a lighter if they found one (it'd probably be put on an altar and worshipped instead, as it should). You might not be able to build your entire cave, but your education alone is a _huge_ advantage!
Surely we are way past the point where someone knows how the whole thing works, all the way down.
I doubt even a very skilled engineer would know how his own machine works all the way down. What I think happens mostly is the skilled dev can use his experience to know where to investigate and where to look for solutions.
The question is organisational. Might it be that certain orgs have gotten so convoluted that they cannot do this investigation on an org level? Essentially, letting the right people look in the right places, unhindered by politics, legitimate security concerns, and practicality?
You'd think there'd be a limit to scale at some point. A bit of redundancy makes sense. There's probably a lot of people with multicloud setups patting themselves on the back at the moment.
I do contracting dev work and my specialty is being able to drill down into any part of the engineering assets, ops, sec, dev. People think someone like me is slow and expensive until they have a problem that no one else wants to touch.
Not at all diminishing what you do, but surely you have a limit past which you say "that's outside of my expertise, or what's reasonable for me to gain expertise given the scope of this issue"?
For instance, I manage a team that does "full stack" development, where full stack means I regularly interact with mechanical and manufacturing, operations, electrical engineers, battery and radio people, embedded developers, mobile, and most aspects of backend engineering. We had an issue where one of our chip suppliers changed their FW, didn't tell us, and we literally were taking apart units to get to the bottom of why units off the line weren't working properly. We go pretty deep. Still, at some point we throw our hands in the air and say "Hardware is hard, it's in the name."
This was meant to be in the context of hosting software services on AWS. Certainly there is a limit. If a MBP get a crack in the case, I'm not going to figure out how to machine a piece of aluminum into a new case, I'll replace the laptop.
You know how AWS virtualization works and how to diagnose a problem with the AWS networking stack? Yes, and yes assuming that everything AWS is responsible for is operating within spec. Obviously, I don't have access to their switches, and cannot see anything at layer 1 or 2.
> I doubt even a very skilled engineer would know how his own machine works all the way down
Knowing how it works, and being able to build a new one, are also two very different problems. For example, there are plenty of Computer Science folks who learned how to design chips (layout the circuits, write the microcode, etc) - but you need a whole extra background in EE and Physics to be able to fab said chip...
Many programmers complete some kind of nad2tetris[1] style course where they go from basic hardware primitives (the NAND gate) all the way up through a small von Neumann architecture computer that can be programmed with a simple homebrew machine code. Even if they don't, a good CS undergraduate program should cover a lot of it, and since most EEs can program at least a little they probably get a pretty good top-to-bottom understanding as well.
The problem is that this is really only possible with a toy model of a computer and very simple programs. Modern chips with their branch prediction and caching and threading and advanced vectorized operations and so on are vastly more complex. The 6502[2] was perhaps the last chip that one person could fully grok. Maybe a chip designer at Intel or AMD could understand the whole circuit in detail but no one else has the time - it would literally be a full time job. The same thing is true for operating systems - even if you're Raymond Chen, you can know a lot about Windows, but you can't know everything.
We learn just enough about the other parts of the system to convince ourselves that we understand the principles. We build the basic mental model we need to interact with other systems but all we can really do focus on our own specialized areas and hope that everyone else is doing their job. This works well enough until something like Spectre[3] or Meltdown[4] crops up and that's when we realize that we've been building castles in the sand.
I mean you can (and I have) etch your own circuit boards, but that's obviously not at the scale you need for anything other than primitive processing (70s 8-bit at MHz scale), and even then you're just looking at the next layer down (how to make your own chemical wash and get copper onto a board) as a barrier if we're truly talking about 'from scratch'.
We really depend on three things - knowledge (stored collectively and in various media eg books), materials (tools and manufactured precursor goods, available via active supply chains or existing stores), and most importantly having our basic needs met trivially so that all our time is not sucked up addressing them.
A scenario where someone has a 'wasteland' to pick over for their basic needs, knowledge and materials looks quite different to a return to primitive living where what nature provides is all their is to work with. 'if you want to bake an apple pie, first you must invent the universe' or however it goes...
Then of course there's the question of why someone would have any interest in obtaining computing power were either of those scenarios to occur. Much like the 'how do we warn future civilisations about our nuclear waste' problem perhaps it is acceptable to not bother, they'll figure it out again on their own eventually given enough time.
This stuff is fun to think about at 6am when hay fever is preventing my sleep :)
> There's probably a lot of people with multicloud setups patting themselves on the back at the moment.
And there's probably an equal number troubleshooting why it didn't failover the way it should, while their upper management starts questioning what they're paying for.
> Surely we are way past the point where someone knows how the whole thing works, all the way down.
I've met a few people who can rightfully lay claim, but yeah, an incredibly rare set of skills.
That said, there is a recent revival in building systems from the ground up. While you can't manufacturer your own transistors, it is quite possible to understand everything from simple logic gates to ALUs to older style CPUs and memory buses.
I built a toy CPU in software once as an exercise. I started with "class Transistor" (wrapping an AND op) and "class Wire" (wrapping a boolean), and wired them together incrementally to make gates, flipflops, registers, etc.
I eventually got a fully-functioning 32-bit cpu with instruction pipelining, two levels of cache, DMA input/output, an asynchronous bus, a custom assembly language with an assembler written in python, and got the Game of Life running on it.
> I could not even try to discover which berries are edible without killing myself.
Cluster berries, from raspberries to pineapples, are never poisonous. Avoid berries that resemble blueberries or currants unless you're able to identify the plant: we grew up with blueberries and know the leaves, but we avoid anything currant-like because we'd have no idea if they're actually, say, chokeberries. Avoid anything that looks like baneberries.
Here in Maine, we forage for raspberries, blackberries, wild strawberries, and (mostly low bush) blueberries, but don't risk others.
You won't find enough berries, never mind edibles berries if everyone in your area suddenly shifted to foraging. Game animals would be exhausted quickly or they would migrate further out from human settlements. Even if you had the skills hunting or foraging isn't all that useful anywhere around a city, especially if everyone else is doing it.
True, fruit and nut-based food forests, like those in the Pacific Northwest [1], seem to provide a significant, sustainable food source. Berries make a nice dessert + vitamins a few times a year.
Historically here in Maine, the core diet seems to have been seafood, freshwater fish, maize, Capreolinae, game birds, eggs, honey, roots, and greens. While only a tiny fraction of fish/seafood remain, deer are over-populated and make a fine sustainable food source, the limitation mostly being the contemporary appetite for venison.
The edible himalayan blackberry infestation that plagues all of the coastal PNW is widely available. It's almost impossible to kill and it fruits for long periods of time.
Blueberries are easy, they have a star shaped "opening" on the bottom. Native Americans called them starberries. No other berry is blue and has that. It's the only berry I trust myself to eat while I'm hiking.
I would recommend you buy a local book on foraging. Keep in case of emergency. But give it a read (at least the first few chapters) so you can get a basic understanding of how to forage without killing yourself. I also recommend keeping viable seeds and a camping shovel around as an insurance policy.
These items aren’t in my earthquake bag (I have enough energy bars to last until the National Guard shows up). Instead these are for a Carrington Event type of solar storm, civil war or some sort of other long-term disaster.
On the seeds front, you really have to be practicing growing food from seed for several years before depending on them for basic caloric needs - after a few years of providing a fraction of our household calories on the property I can see the pitfalls, effort and planting diversity needed were we to need to scale it to that level. The previous me would have had some seeds and a dream, and have died real quick. Even now I give myself 50/50 that water, weather, pests, poor soil, or something unexpected would lead to starvation.
Yes we provide a few hundred annual calories from seed. Not nearly enough to survive. But hopefully enough to learn from while foraging or enough to link up with actual experts who might just be lacking in seeds or labor.
But you could probably devise a scheme by which you feed all of your students an assortment of berries and figure out which ones are safe based on which students get sick or die.
You are close. You first rub the berry on your skin (or leaf, or whatever). Wait 24 hours to see if a rash develops. Then you taste it, wait another 24 hours. Then you eat one, and see if you get sick after another 24 hours. Now you can eat several, and build up from there.
Yes, that is a lot of time to go hungry and testing just one item. And then you still don't know what actually gives you nutrition vs just not killing you (for example leaves that you can break down such as leaf lettuce, vs eating grass).
Assuming this is in the context of some apocalyptic event requiring you to do this, relying on animal husbandry seems obviously wrong. Plus you only get ~10% of the energy from the lower trophic level.
The prevalence of meat comes from a society of abundance.
Evidence of early hominids and other less advanced proto-human or human groups shows a pretty significant amount of calories came from meat. Some suggest 60-80% of calories came from proteins, largely meat, at various times in history.
Both of these estimates are way higher than what we know people eat today, where meat and dairy are 18% of worldwide calorie consumption (27% in the US).
So I think the abundance we see today is actually due to the availability of non-animal dietary sources.
That is the Universal Edibility Test and gets repeated ad nauseam in all the survival circles. You would miss out on some fine choice foods if you did that.
Stinging Nettles (Urtica dioica) is one of them. Pokeweed (Phytolacca americana) too.
Source - used to teach these skills before it was cool to be a "survivalist" on TV and social media.
Yes I don't know how someone figured out that if you cook pokeweed and change the water multiple times then you can finally eat it without it killing you.
as an NYC-born, growing up with bi-monthly boy scout meetings and yearly "wilderness camps" (pitching tents in open fields, pit latrines, war games/survival, etc.) really helped fill in that gap :)
We are the advanced civilization that has built those hot baths. Being an advanced civilization, it's safe to assume that no single person knows all the knowledge necessary to build another hot bath, because it has long surpassed how much one person can learn in a lifetime.
But somehow there are multiple organizations that "know" how to build another hot bath, and newer and bigger baths are continuously being built all across the Empire.
And occasionally one of them stops working and thousands of citizens are angry, because they feel, being honest citizens of the Empire, they are entitled to enjoy these hot baths. Sometimes their very livelihood depends on the baths running.
For the past couple of weeks, I've been a beginner-intermediate mechanic trying to breathe life into an aging car.
Sometime in the next few months, I've to troubleshoot and fix the broken 2 yr. old refrigerator. Someone came and fixed it once, now it's out of warranty and fixing it would cost about 50% of its cost. Meanwhile I'm glad I didn't throw away the 10 yr old refrigerator and just moved it to the garage. We just have to keep going to the garage.
I also have to play the accountant for my consulting business pretty soon. This is a task I had outsourced for years and have now started doing myself.
As stuff gets more specialized, I've started noticing that I'm able to do moderately complicated things better than professionals paid at the 50th - 70th percentile. If I want to get a really good job done, my rule of thumb is to be ready to shell out money in the 90th percentile range and look for references.
In case of AWS, I guess the Greasemonkey scripts are getting too complicated ;)?
The remarkable thing is that today no one knows how to “fill up the baths”, or to do more than a small part of the job. Teams exist with extremely narrow expertise. But if anything, there are more options today for DIY infrastructure - way easier to be more advanced than “run the Apache on the server box.”
I don't buy this. I've written some pretty complicated codebases at previous companies that no one knew how to operate except for me. After I left those companies they didn't fold or lose all their customers. They adapted and everything is fine. For whatever reason humans find simplicity through complex processes.
I liked that. It might be ever weirder though for us in the new age. We'll have robots (AI) running everything and why things are happening will degrade into unknown unknowns. Engineering and critical thinking may be a lost art.
Or worse, we will be sitting in the hot baths and the hot spring where the water comes from will get hotter and hotter and we won’t notice until suddenly a rapid change in temperature boils the water and burns us to death.
Tangentially related: If you enjoy this sort of idea in fiction-form, I can't recommend Josiah Bancroft's The Tower of Babel series (beginning with Senlin Ascends) enough.
On-prem, maybe, but if you include co-located equipment and managed hosting I don't even think it's more rare in absolute terms. Just smaller as a percentage of overall hosting.
There's (still?) a lot of on-prem and managed hosting. It's probably the majority of hosted services. Otherwise VMWare wouldn't be doing as well as it is.
https://downdetector.com/status/aws-amazon-web-services/