Semi-related: if you ever feel the need to report times to a global audience, not only make sure to always report the timezone (even if it is the same as the user's), but also use UTC offsets rather than timezone names.
Life is too short to remember what each timezone name means and converting to it, UTC offsets are much easier on the mental calculator.
It's also not too complicated to add a few lines of javascript that show the datet/time in the user's local time zone (via Date.getTimezoneOffset) as well.
GCP's various products have gotten a lot better at this lately, but just a few months ago I could click around between various dashboards and explorers, some showing the time in UTC, some in your browser's tz, and some in your profile's tz (if I recall correctly). Some of them were showing the tz, and for some you had to guess. Sometimes you had multiple tzs on the same page. Sometimes the date picker for a control was in one tz and the widget it was controlling in another (leading to quite a lot of confusion).
The worst offence IMO was not showing the tz at all. Especially given the overall lack of consistency.
We do this in all of our web apps. It's pretty simple and dramatically improves UX when you have customers that are doing a lot of scheduling.
Showing both at the same time is peak design for me personally. UTC compares for relative sequencing, local time for "was that before or after I ate lunch".
still show the timezone it's displayed in, users aren't always fully aware of which time zone their browser is believing they are in (sounds stupid at first but imagine you just traveled somewhere and are temporary used to that time zone but e.g. your laptop is fixed set to your home time zone, or maybe it's not and you just thought it is, or maybe you are just a bit confused because you switched time zones 4 times the last 24 hours etc.)
Also report it using IATA time zones (America/Los_Angeles) at least in addition (I'd argue instead of) those abbreviations which are completely unstandardized and not unique.
If the world were fair, we’d be calling these Eggert time zones, as Paul Eggert (longtime tzdata maintainer until the copyright trolls came) invented them; but it isn’t.
(You probably still meant IANA the Internet org not IATA the aviation one.)
Everyone knows "Mountain Time". It is when you go to the mountains on vacation, and don't spend much time adhering to a strict schedule, instead taking leisurely strolls around the fields and promising vague things like "I'll try to be back for dinner".
3 years ago, when I started work for my current employer, I noticed in Slack that everyone was reckoning time in "Standard Time" year-round. Now imagine my chagrin because I live in Arizona, and "Mountain Standard Time" does not change for DST. Therefore, all my coworkers were citing nonsensical, nonexistent time zones and it was messing up my ability to convert back and forth.
Come to find out that this was some sort of entrenched, company-wide standard that was deliberately imposed. I made a lot of noise about this and appealed to some rather highly-placed directors, because I felt like it was wildly inaccurate and deceiving people; if you schedule a meeting in EDT but you say it's in EST, and we have employees all around the world, who's going to know? You're inviting off-by-one errors. Especially with me who lives permanently in MST.
3 years on, I've been unable to change this fundamentally; while a few people acknowledge DST, 90% of the company still adheres to this crazy false standard.
I just had someone asking me if I'm available at 5pm EST.
Also, your clock can get confused driving North from PHX to Zion National Park.
In summer you start in Mountain Standard Time, drive into the Navajo Nation which does observe Mountain Daylight Time, containing through the Hopi Reservation, which is Mountain Standard Time. Then you end up back in Navajo Nation with Mountain Daylight Time. You keep on driving towards Page which is in Mountain Standard Time. However, when you cross the state-border of AZ/UT you're back in Mountain Daylight Time.
How does this generate off-by-one errors? I am also part of a company with employees in pretty much every timezone, but when they create a meeting the meeting invitation is programmed with the correct timezone so in my Calendar it always shows what time the meeting is going to be for me. I never even have to think what timezone the organizer is...
The off-by-one error occurs when you announce an event in Standard time but really mean Daylight time, or vice versa. While those local to the time zone will often automatically correct this mistake either consciously on unconsciously, those in other time zones (especially where Daylight time isn't used or is on a different schedule) will tend to rely on time conversion tools which will take a literal interpretation of the scheduled time and result in the person being an hour early or an hour late.
The fact that you have to announce timezones is already an error. If I need to schedule a meeting I don't need to select timezones, they're already selected from the timezone I'm part of. There's never room for error by "picking" the wrong thing, since there's nothing to pick. And if my system is programmed with the wrong timezone, then every single meeting will be off-by-N and my calendar will show the wrong time as "now". It would be impossible to miss such an error.
I think your company needs better tools to handle meetings.
It's the same at my company. Teams and Zoom both automatically schedule meetings in every attendees' own time zone. Maybe that person's company still does phone meetings or something.
We don't use any automatic scheduling with Zoom or Google Calendar. Management doesn't send invites to those meetings, they just publish the link on Slack and we have to figure out how to get it into our calendars.
Trust me, at least once I missed a meeting because I was late by an hour due to time zone confusion.
I mostly struggle with Irish Standard Time (used for DST in Ireland) and Indian Standard Time which have the same acronym. :(
Thankfully, I learnt a long time ago to use ISO 8601 and UTC for dates and times. I still revert to PST/PDT if my audience is primarily left coast based.
And I can't say it's ever actually caused a problem, but something about Indian Standard Time being a half-hour offset from UTC has always bothered me so much... But now we're fully off-topic.
And if you've been American since birth, and live in Arizona, one might still not know, since PDT and Mountain Time alternate covering Arizona seasonally. ("Ask me how i know.")
It can also varies within Arizona... one of the most confusing times in my life was driving from California through the Navajo Reservation in AZ on my way to an appointment. Was my cell phone giving me the local time on the reservation? Was it connecting to a cell tower just outside the reservation, giving me DST-less Arizona time? Or a tower slightly further away in Utah (DST?) Or was it giving me the time on the Hopi reservation, which is an enclave totally surrounded by the Navajo Reservation which uses AZ time?
Even in Australia, AEST has a DST flavour and a non-DST one. Queensland does not observe DST while the other states do. You can drive around a roundabout at the border and switch timezones for fun. Or go down there to celebrate the new year twice.
Or go from rabbits being OK to some 5 figure fine if you're caught with one :)
> Life is too short to remember what each timezone name means and converting to it, UTC offsets are much easier on the mental calculator.
Many people also get the timezone names completely wrong. I've had multiple scheduling email exchanges where someone says X pm EST not realizing that at the time it's currently EDT and that EST ≠ EDT.
And yet, for some reason, the two-letter abbreviations (e.g., ET) that are technically correct year-round, never seem to have caught on in the wild.
I've given up on the abbreviations and just say "Eastern" now to avoid confusion.
Fair enough, but please only use that with strictly USA audiences. (And remember that public information likely will not be targeted to strictly USA audiences)
Names don't carry any information intrinsically, they are only a reference to the actual information, and the offset information is pretty short, so why not just provide the information directly?
"X pm GMT-3" only requires the reader to know their own timezone offset, unlike "X pm Brasilia time" (which is inaccurately known as São Paulo time outside Brazil) or "X pm BRT", which requires the reader to both know what that timezone means, and their own (or, more likely, requires them to look the conversion up).
(And if the difference between GMT and UTC is significant, I hope it didn't take my comment to convince you about using offsets :> )
UTC is human readable even if it is not calculated correctly. yes, i'm saying that if you can read epoch seconds, you're not human. 1970-01-01 00:00:00 is always a give away that something is a foot
"anchored on" then? I might be wrong but we're both talking about showing time as distance from the same starting point are we not? One's just more human readable so that's why I say why not just use that? Seconds since can be miscalculated too, especially if current time isn't known/reliable
Nothing worse than people who say "9 AM my time" I suppose it's OK if it's Pacific vs Mountain but even there Arizona doesn't observe Daylight, and parts of Eastern Oregon are Mountain, not Pacific.
Never mind dealing with India, Australia, etc etc.
OK to use local time in your statement, just say what that time is.
The inconsistency with timezones across different services in the AWS console has always baffled and annoyed me. Some places have a time without a timezone and I can never tell right away if it's utc, local time, or region time.
> The inconsistency [of everything, everywhere] in the AWS console
ftfy
AWS is powerful and very popular, but for the console, "it functions" must be the only condition the UI has to satisfy. Should every page use a unique table and sorting widget and UI language? Yes, please!
I'm assuming this helps them move fast, not having to coordinate with anybody or wait for a UI designer to tell them how it should look. But it's striking when compared to GCP.
Technically PDT is always 7 hours behind UTC. PST is always 8 hours behind. We just change which one we use twice a year. Pacific time makes sense when you realize Fremont is the center of the universe.
Yup, this is why I always say “US ET” (I'm on the east coast). I don't trust myself or anyone else to get it right, and if the other party is converting anyway, their conversion tool (google?) should be able to handle that. (Of course, the date is necessary but implicit, but that's usually fine too.)
Indeed. There are Americans who will tell me PST, when they meant PDT but forgot to mention that. Now I have to track the American DST calendar as well as European DST calendar to do the conversion.
There are also people who tell me GMT (because they think that term means "the time in London") when they meant BST (because in summer, London doesn't operate on GMT).
The outage is in Virginia so PDT isn't even local time. On their status page they are asking users to access the console via a region specific endpoint like https://us-west-2.console.aws.amazon.com. Wonder if the PDT timestamp is because they have to serve the status page from US West right now.
The fact that which timezone is used in the announcement is a sign of progress... AWS announced it pretty quickly, gave nice updates, and seems to have fixed the problem quickly enough. I'm interested to see the postmortem...
When I was with AWS I advocate for ISO8601 "Z" whenever I could or need to influence, say internal systems.
If all systems talk this we'd save tens of thousands of man hours. Just do the conversion for us mortals, or other necessities. Tech side of incidents is definitely "system", I'd argue more often than not consumers of AWS are also tech side with systems in UTCs so health dashboards should also be a UTC first system. Doubt this could get prioritized tho
Imagine you have two browser opening the same page, one showing UTC, another showing your local time.
There are no indicator showing which time zone is used. You have to mentally correlate "this browser windows has logged in..." with the time shown on screen.
After spending some time in the Canary Islands I realized how nice it was to be in UTC all the time and now I have my laptop clock set to UTC. Still contemplating whether I should set Google Calendar and my smartwatch to UTC as well. 8-)
It doesn't matter if your infra is in another region, because there will almost always be transitive dependencies on us-east-1. IAM is deployed in us-east-1 and there will always be a transitive dependency on us-east-1
I have never had a production issue in other regions due to a us-east-1 outage. The worst that ever happened was I had to wait to update a Cloudfront distribution because the control plane (based in us-east-1) was down, but the existing configuration continued working fine throughout.
I don't know what the architecture of IAM looks like, but somehow it's never suffered a global outage.
certificate manager also down (I know because I tried to update an ssl cert for cloudfront which only allows US-East-1 ssl certs, maybe someone will eventually fix that to allow any region to have the ssl cert for cloudfront)
My Whole Foods grocery pickup order was affected by this outage. They couldn’t check me in. Groceries were packed in the fridge but they told me to come back later. What a waste of time.
Are you just guessing it's related or is there a reason to be sure? Either way, does seem sjll6 to have no manual fallback 'check-in' (or policy-level ability to bypass the need for it) mechanism.
I've known 'just take your shopping and go' and 'let me make a note of your details', but never 'put your bags down and bugger off but please do comw again'.
And why would ordered and paid for shopping be so inaccessible? It's not going to be locked away in a vending machine type thing right, you just need to show up with ID/log-in and collect it. Or so you should.
I wonder if this is a coincidence or if us-east-1 is simply down enough that I'm just experiencing selection bias; but I posted a poll on twitter earlier today: https://twitter.com/dijit/status/1668678588713824257
Contents:
> Has anyone ever actually had customers accept an outage because AWS was down; or is this just cloud evangelicalism copium?
I guess a demanding customer would have said 'you should have implemented disaster recovery so you could failover to us-east-2' but that's easier said than done. The more regional AWS services you adopt, the bigger the impact is. How does one recover from a regional outage if their pipeline is in that region?
What I did once I was in the position of _having_ to provide that level of support, was to run the pipeline in a third region, different from the "prod" ones. That way, worst case you can't do deployments during the outage...
Another alternative studied was to use a thirdparty ci/cd service, outside of our network. It was discarded bc you never know where that would actually run
> It was discarded bc you never know where that would actually run
Yep, I considered that switching to GitHub Actions would _theoretically_ eliminate the need for disaster recovery for CI/CD (since the handling of disasters is out of your hands) but in practice their SLA is far worse than just running CodePipeline in a single region.
Yeah, that's why we went with a third region instead. But, at the end of the day, if _only_ changes are affected for a couple of hours, that wouldn't impact the service that much
I’ve worked for several systemically important megacorps where certain things had to not only run cross region but also cross provider. It’s absurdly difficult, and only should be done if you need five or more 9’s of availability. Almost nothing actually does.
it's important to inform customers about the resiliency of their systems and let them pick how far they are going to invest for it.
then you get to eat popcorn when stuff explodes.
* single server event. $
* multi server event. $$
* single az event. $$$
* multi az event. $$$$
* global provider event. $$$$$
* cross provider event. $$$$$$
* alien invasion. $$$$$$$$$$$$$$
Back when we had servers in an onsite DC we lost a raid card and the system I was developing went down. We had the fancy support so a tech was out with the card replaced in a couple hours, then we had to restore from tape backup. All in all, a non-critical system was down for most of a business day. My bosses boss stormed in, upset he couldn't pull a report, and asked how do we prevent this in the future. I responded at a minimum we had to double the cost for a hot standby, and he said 'never mind' and walked out.
Multi-planetary-AZ DB cluster deployments. Putting the emphasis on "Eventual" in eventual consistency.
Go for a walk before retrying reading from this replica!
And you thought the current time zone confusion was bad. Now you have two sets of time zones, and a varying delay of about 5 to 21 minutes between them. Oh, the joy!
My employer lets customers choose which of our supported regions to run in and exempts cloud provider outages from our SLA (we’re on the hook for staying up for single AZ outages, but not multi AZ or region outages). We provide tools to help customers replicate their data so they can be multi-region or even multi provider if they want to.
AZs don't really help when it's AWS' own services across the entire region that break. Anecdotally, we have had customers accept outages that were out of our control without penalty.
Wild, that wouldn't have flown with datacenter providers having issues for my previous companies.
AWS really does have an easier time than old school datacenter providers. I guess the complexity is higher but it's shocking that they can charge so much yet we hold them to a lower standard.
DCs are pretty static and offer way fewer services than AWS or any other public cloud.
I worked for one for some time and whenever we had issues, some people would call and ask if we were going bankrupt. It gave me a feeling they also have way smaller customers that might not understand the underlying stack.
If all you use in AWS is static EC2 instances you would have to go back a looooong time to find an outage which affected their availability. Even in us-east-1.
Outage rates are also wildly different. When you're using dozens of managed services and have a few prod-impacting outages with any reasonable (cross-AZ) design, customers are less sensitive then when they are dependent on dozens of products that hav independent failure modes with potentially cascading impact.
AZs also don't help with natural disasters at all. I believe AWS is the only one doing geographically distributed AZs, for the others it just means different connections and placed somewhere else in the building.
edit: turns out AWS is the one with geo distribution, not Azure
Aws azs are also distributed geographically within a region w separate power and network lines. From the docs “ Availability Zones are distinct locations within an AWS Region that are engineered to be isolated from failures in other Availability Zones.”
Ah, you are probably right. I was thinking of the incident a few weeks back where the fire suppression took out multiple AZs, but that was actually GCP.
> Has anyone ever actually had customers accept an outage because AWS was down...
Whether customers "accept" it or not just comes down to what's in your SLA, if you have one in the first place, and if they are on a contract tier that it applies to. [Many servies provide no SLA for hobby / low tiers, beta features, etc.]
Firebase Auth, for instance, offers no SLA at all [1].
I would be curious to see statistics across a range of SLAs for what % include a force majeure or similar clause which excludes responsibility for upstream outages. I would expect this to be more common with more technical products / more technical customers.
Ok fine. Running your own datacenter in 2023 is incredibly risky. There's the upfront server cost and the ongoing maintenance cost. There's patches and staffing and disaster planning and all the other things that goes into it. Plus there's the cyberinsurance and protections and security components too.
Do you really think other (smaller) orgs can do a better job at hosting a datacenter than Amazon / Google / Microsoft / Cloudflare? They have some of the brightest minds in the industry working there, and they can price things at a much better price than anything you can build yourself.
Yes, I get it. All the computer processing power in a handful of actor's hands is probably not the most fantastic thing. However with the price of some cloud vendors compared to the DIY approach, it's hard for organizations to ignore.
If you really want to combat this, make the cost of running your own data center less. Reduce risk. Reduce the amount of money it costs for hiring good people or MSP's. Reduce the cost of acquiring and installing hardware.
Organizations pay attention to dollars so if you want the trend to shift, come up with a less costly alternative to the current cloud offerings.
> Running your own datacenter in 2023 is incredibly risky.
There are middle grounds.
But let's be honest: 99% of companies have never done the napkin math, because nobody ever got fired for choosing IBM^W AWS.
We joked about this in my company: we had a variable-load thing that we used autoscaling in the cloud for, but it had a baseline load that purchasing a real machine might have made a lot of sense for. The napkin math probably checked out. We never suggested it more than jokingly, though, because even when we suggested it jokingly, we got shut down: "You don't understand the cost of that." No, actually, we jokingly did enough math that we do understand, better than the people criticizing us did. We never did it.
Whenever the "own it" argument comes up, eveybody is real quick to hop on the "but maintenance cost" train. But as I perceive it, those who believe in the cloud budget exactly $0 for maintenance of managed cloud resources. As someone who's only done cloud, that number is unadulterated bullshit: the number of hours I've had to spend chasing cloud vendors to do the job that we're paying for is just silently flying under the budget radar. In the minds of the finance books, I'm 100% SWE, but in reality, I'm 75% SWE, and 25% support ticket monkey.
At least with a real machine, it'd be interesting, and I'd have some agency to actually solve the problem. As a support ticket monkey, I'm utterly powerless. I'm tired of having to beg.
That's not to say I'd move everything off cloud; I actually think the vast majority of what we do is well-suited for cloud, mostly because upper management can't make up their mind about product direction enough to be able to say "yes, we can purchase this and we'll use it." But those nuggets of stability do happen from time to time.
> disaster planning
"Disaster planning" is something every org wants, because they're trying to tick the box with the regulator. But the requirements that get passed down border on absurd: "what if a meteor hit AWS and they were never able to recover from it?" … we're literally never going to plan for that, because the $ needed for that level of eng. work is not going to happen. A sane scenario would be "can we handle an AZ outage?" (or, let's start there, and maybe, maybe if we can get that down pat, then we can graduate to regional outages.)
> cyberinsurance
… you don't get out of this via being in the cloud, if you need it. (I wish we did, because ours pushes some utter inane requirements.) I can mismanage a machine in a DC just as easily as I can mismanage a VM in the cloud.
> Organizations pay attention to dollars
No they don't. This oft-repeated mantra is nonsense. Finance dept. get an invoice that has a total; even were they to have access to the finer billing information, they're not technical, and cannot understand it. I've yet to be at a company that's dedicated sufficient resources towards infra eng such that we could do the legwork necessary to present a sane organizational view of what cloud infra dollars go to what high-level objectives or teams. The resource tagging isn't there, and even if it were, some things cannot be tagged, and you still have to aggregate bills from a dozen different vendors, and then figure out what weights to apply to shared resources across OUs. I'm on employer #4? and have yet to see anyone scratch the surface of that.
Which is why you see articles about cloud $ waste all the time.
What happens far more often in my life is someone from management descending with "why are we spending $X on Y?", where $X is usually an order of magnitude wrong, or Y is … something we're not even doing anymore? And then you have to go round the mulberry bush of "how did you arrive at that figure?" "okay so here's what those numbers mean" "here you're adding $/mo and $/yr and you can't do that"
> Do you really think other (smaller) orgs can do a better job at hosting a datacenter than Amazon / Google / Microsoft / Cloudflare?
Than Microsoft? Absolutely yes. The others, probably not.
> Yes, I get it. All the computer processing power in a handful of actor's hands is probably not the most fantastic thing.
The long-term end state of not investing money into R&D is that it is centralized into those who do, and you become beholden to them. You get what you pay for, here. It's not good, and I think there's discussion to be had around that, but my real problem is the cognitive dissonance that follows. If you want to centralize on one of the cloud duopoly, then you also need to acknowledge that your own eng cannot be held responsible for the cloud's reliability: they have no control over it.
Excellent response, thank you. To expand on the cybersecurity aspect - think of services like Cloudflare WAF and DDoS protection. These services are very easy for orgs to implement and do a really good job at covering 95% of threats quickly.
Could you imagine a 1,000 person org with a 20 person IT department rolling their own DDoS solution?
But yes you are also right, cyber insurance is still required, and even AWS touts an expectation of a “shared responsibility” model.
I’m still skeptical that cloud hosted offerings are a bad thing. For a long time there were only Ford, Chrysler and Chevy in America, then foreign imports became popular, then a few years ago Tesla became a contender.
I still think new entrants can come into the cloud space, particularly in Europe, but they need to do their due diligence and understand their competitors offerings very well.
I don’t think any individual provider would be significantly more reliable. But it would make the landscape more diverse if people did not put all their eggs in the big three providers.
which is legitimate - if only you're down then you're losing business to your competitors and failing those who rely upon you; if everyone's down it's a wash. And frankly it's not like you're going to have significantly better uptime by going against the crowd.
Yes. This morning I found my Roomba (i9 something IIRC) sitting idle in the middle of the kitchen. When I launched the iOS app, it appologised it couldn't connect due to AWS being down (I regret not taking a screenshot).
Usually us-east-1 is deployed to after several smaller regions. Usually it'll fall in the middle of the week depending on the pipeline.
Just because a feature is there on launch day doesn't mean it was deployed to first. Features are often hidden behind flags that are switched for launch.
I'm well aware of that, but the point is that when the feature is ungated to the public, it's in us-east-1 and gets all that load, and more load than the rest because of the fact that a lot of big customers are based in us-east-1, including much of Amazon itself.
Those are not single-region services. Changes must be executed there, but the data is replicated globally. If you don’t need to make changes in the context of those services, they will keep working in the other regions even during an incident in the primary region.
us-east-1 is the largest region, so it is where changes meet scale.
It is also a massively complex beast in itself spanning dozens of datacenters with massive amounts of fiber between them. Much more fragile than having everything in a single building and as you scale up the number of components you increase the rate of failure.
Touché. Still I'd rate the overall reliability of AWS higher than Azure; and even if that weren't the case, security issues make Azure look like a very poor choice.
I don't think this article has any value. Are you only counting region wide outages? US east is probably 10x the size of any other region with more AZ's than any other region.
No definitely not. Usually pipelines deploy over 1-2 week periods, and they don't deploy on Fridays/holidays/high-traffic periods like December.
Deployments start off very conservative, maybe 1-2 small regions on the first day of deployments. As you gain confidence, the pipeline deploys to more regions/bigger regions.
A pipeline that deploys to 22 regions over one week might go from 2 small regions on monday, 4 small/medium regions on tuesday, 8 medium/large regions on wednesday, 8 regions on thursday.
us-east-1 is usually going to be deployed to on the wednesday/thursday in this example, but that isn't always the case because sometimes deployments are accelerated for feature launches (especially around re:invent), or retried because of a failure.
There are best practice guides within Amazon that very closely detail how you should deploy, although it is up to the teams to follow them, which they usually do an okay job of.
I have a suspicion that AWS uses some regions as canaries. Because we control both ends of things, I have personally noted that certain AWS functions clearly break in Australia first.
When I worked there, there were few hard and fast rules. Every team had its own release processes, so there was a lot of variance. It has been a couple of years, so this may have changed.
Typically, a team would group their regions into batches and deploy their change to one batch at a time. Usually they follow a geometric progression, so the first batch has one region, the second batch has two regions, the third batch has four regions, and so on. This batching was performed for the sake of time; nobody wants to wait a month for a single change to finish rolling out.
One reason not to deploy to us-east-1 in the first batch is so you don't blow up your biggest region. The fewer customers you break, the better.
One reason not to deploy to us-east-1 in the last batch is that there are a lot of batches. If a problem is uncovered after deploying the last batch, then someone has to initiate rollbacks for every single region.
Some teams tried to compromise and put us-east-1 in one of the earlier batches.
Say what one wants about Amazon or how they treat delivery drivers. My son told me of some experiences he's had--including having his car totaled on an icy river bridge at four in the morning--and Amazon was on top of it, empathetic, and did more than what was expected.
I had a similar reaction. Oh no WTF, how did i break that?! Then my buddy texted me about us-east-1 being down. Then i thought "Oh thank god, this shitshow is someone else's fault."
You generally want to use a region close to your users, so right off the bat, us-east-1 and us-east-2 are the obvious choices for most East Coast companies. If I were starting a new project, I'd probably go us-east-2, but if your company has been on the cloud long enough, us-east-2 might not have existed when your foundational infrastructure was created. And for most companies, going multi-region is an expensive, difficult proposition that might not be worth it.
Plus, as others have noted, there are critical AWS services in their control plan that only run in us-east-1 behind the scenes. So you're kind of out of luck.
There are some services (cloudfront for example) which require this region. Its not that much harder to have multiple regions in your deployment but putting everything in one is simpler for smaller startupy orgs.
It’s not required during an incident. The data for these services is globally replicated. It’s only if you need to make changes that you might be impacted if you’re already successfully operating out of another region.
No, I really don't like any of the serverless or amplify type of frameworks that AWS produces.
What I needed was a distributed store for managing a large number of small configuration "files" and other state "links" across the infrastructure I built. In particular, I needed the ability for a write conflicts to be detected and managed immediately and for consistent reads to be available in some cases.
Looking at the size of data and number of transactions, DynamoDB on a per-request model was going to be significantly cheaper and easier than standing up a bunch of DB instances or other "fixed" infrastructure.
The ultimate model was to actually implement something like a POSIX filesystem on top of the "nosql" layer. In practice, I've been really happy with that decision, and it makes working on the code that interacts with this system very familiar and easy to understand. It's even got symlinks, acls, and automatically expiring advisory locks.
I used it because early on in the project I wanted to use features for IoT that were only available on us-east-1 initially, as well as lambda@edge which was on us-east-1 only at the time.
If you’ve already set up your ACM cert, you won’t be impacted during an incident in us-east-1. It’s only if you need to make changes that you could be impacted.
As a side note, I wonder if businesses won't even accept cash if they can't go through their POS system. If not, it's a shame that these modern internet connected POSs lock out stuff like that.
Depending on country and exact POS setup, they might not be able to take cash if POS is down.
For example, in Poland, your typical restaurant or shop needs to generate tax receipt (as well as properly calculate the tax), and uses either a separate receipt POS device, or POS with appropriate receipt printer (the devices are certified and for example do simultaneous two prints - one for client one for seller - or use digitally-signed storage for seller copy).
If the POS isn't designed properly to operate in case of network failure... welp, can't take cash either, at least not legally.
I was at a swim meet last week, and one of the food trucks was using Apple Tap to Pay because Toast didn't have a solution that worked for them, on site. After they finish up at an area, they then enter a single transaction for all of the day's business into Toast.
It's fun watching each service fail sequentially while the aws service dashboard just updates them to "Informational" status, whatever that means.
Even management console is down, and their suggested region specific workaround does not work, at least for us-east-1. I can see some processes via api but I don't have code prepared for monitoring every service from my local.
Yep, it has issues so frequently. I wonder how many companies/teams start using AWS and blindly choose us-east-1 without realizing what they're getting into.
<rant>
It's also quite annoying sometimes that some things _need_ to be in us-east-1, and if e.g. you are using Terraform and specify a different default region, AWS will happily let you create useless resources in regions that aren't us-east-1 that then mysteriously break stuff because they aren't in this one blessed region. AWS Certificate Manager (ACM) certificates are like this, I believe.
ACM certificates themselves can be had in any region (and you can use them for stuff like ELBs), but since the Cloudfront control plane is in us-east-1, if you want Cloudfront (and IIRC, also if you want custom domain names for an S3 bucket, but don't quote me on that) you'll have to create an additional certificate in us-east-1.
I worked with a devops person who moved everything we had set up in other regions _to_ US-East-1 because that is where you are supposed to run stuff. According to him, the other regions were just for DR stuff.
What's interesting is that I can still access my EKS cluster, but none of the deployments are "ready" that have LBs attached to them. Pods can create fine though!
I kicked off a Redshift cluster in every region, they've all run and completed, except for `us-east-1`, which is stuck creating the cluster. Been about an hour now.
Seems like it took IMDB with it. Surprised that Amazon is not able to keep their own property up when one of their zones goes down. Not a great example.
I'm not sure it's the case here, but the issue with these cloud providers is they use their own services to maintain their infrastructure - that's why when something like lambda gets degraded, which would not shock me if they're using everywhere, you start to see random crap like console and IAM go down as well.
Weird, I didn't notice the actual outage at all except high ping to a non-AWS server/IP on the west coast. Normally the latency is ~85ms, today it has been >170ms. SSH is basically unusable at that latency and even bandwidth is very low (not sure how that could be).
You'd think the largest cloud would be resilient to a zone going down, but I guess not.
i find it stupid that clients get to know about regions at all. They should only notice latency hits and batch job queuing latency if something bad happens underneath, but no services should go down.
> At that time, we began processing the backlog of asynchronous Lambda invocations that accumulated during the event, including invocations from other AWS services. As of 3:37 PM, the backlog was fully processed
pretty cool that stuff was stored in a backlog and eventually processed!
I am guessing they mean SQS queue … so basically the SQS aid doing what is supposed to:
- try to process event
- send to DLQ if it fails
- there’s a redrive button in the DLQ to … well restive the events after the lambda is fixed
the last big us-east-1 outage was ... DNS - and it's usually DNS or software-defined core networking causing these cascading failures
Loss of DNS causes inter-service api calls to fail, then IAM and all other services fail. Anything not built to handle those situations with backoff causes a 'stampeding herd' of failure/retry and exacerbate the outage
Some parts are expensive (EC2, NAT and VPC endpoint proliferation) but others are simple and inexpensive (regional API Gateways, latency based routing, DDB global tables, lambdas, state machines).
I'm so glad my demo today was specifically about local inference on... Windows. I guess working I finally found an upside to doing ML outside Linux ; we don't have Windows VMs on AWS!! :)
Had our login and other features go down less than an hour after I altered our prod scheme and thought I did something wrong, what a relief it was to see this
The second law of thermodynamics guarantees that there will always / eventually be downtime… and that’s okay. Design for downtime. Shameless self-plug for https://heiioncall.com/ for free website / HTTP endpoint monitoring and cron job monitoring if you want to know about your own app’s downtime.
EDIT: thanks to those of you who have signed up in the past few minutes! Let us know if you have any feedback.
"eventually" is doing a lot of work in that sentence. Our sun will last another 5 billion years, and the heat death is something like 100 trillion years away.
Though a lot of practical thermal-related causes of electronics failure seem to operate on timescales of years to decades, like electromigration https://en.m.wikipedia.org/wiki/Electromigration or even just cooling fan bearing failure. And I don’t think it would be a huge stretch to point to electromigration as a case of diffusion, a natural entropy increasing process, re-randomizing the arrangement of atoms within a transistor (and therefore making it fail eventually).
AWS can do so many things, reporting critical outage updates in UTC is not one of those things.