Hacker News new | past | comments | ask | show | jobs | submit login
AWS us-east-1 down
658 points by rurp on June 13, 2023 | hide | past | favorite | 313 comments
The status page says everything is fine though.



For those wondering: Currently PDT is 7 hours behind UTC.

AWS can do so many things, reporting critical outage updates in UTC is not one of those things.


Semi-related: if you ever feel the need to report times to a global audience, not only make sure to always report the timezone (even if it is the same as the user's), but also use UTC offsets rather than timezone names.

Life is too short to remember what each timezone name means and converting to it, UTC offsets are much easier on the mental calculator.


It's also not too complicated to add a few lines of javascript that show the datet/time in the user's local time zone (via Date.getTimezoneOffset) as well.


As long as you still show which tz it is! :)

GCP's various products have gotten a lot better at this lately, but just a few months ago I could click around between various dashboards and explorers, some showing the time in UTC, some in your browser's tz, and some in your profile's tz (if I recall correctly). Some of them were showing the tz, and for some you had to guess. Sometimes you had multiple tzs on the same page. Sometimes the date picker for a control was in one tz and the widget it was controlling in another (leading to quite a lot of confusion).

The worst offence IMO was not showing the tz at all. Especially given the overall lack of consistency.


We do this in all of our web apps. It's pretty simple and dramatically improves UX when you have customers that are doing a lot of scheduling.

Showing both at the same time is peak design for me personally. UTC compares for relative sequencing, local time for "was that before or after I ate lunch".


still show the timezone it's displayed in, users aren't always fully aware of which time zone their browser is believing they are in (sounds stupid at first but imagine you just traveled somewhere and are temporary used to that time zone but e.g. your laptop is fixed set to your home time zone, or maybe it's not and you just thought it is, or maybe you are just a bit confused because you switched time zones 4 times the last 24 hours etc.)


Honestly this is the worst when there is no timezone marker and some times are in browser time and others arent


Or when logged times in the past change depending on today’s daylight savings setting.


given how long it took AWS to add support for Ed25519 ssh keys (literally just fix the validation regex), I wouldn't hold your breath



Also report it using IATA time zones (America/Los_Angeles) at least in addition (I'd argue instead of) those abbreviations which are completely unstandardized and not unique.


If the world were fair, we’d be calling these Eggert time zones, as Paul Eggert (longtime tzdata maintainer until the copyright trolls came) invented them; but it isn’t.

(You probably still meant IANA the Internet org not IATA the aviation one.)


Is there context on this? I'd be interested in the history of what happened and why Eggert is no longer involved.



It's also usually extremely US-centric. Nobody outside of North America has any idea what "PDT" or "Mountain Time" means.


Everyone knows "Mountain Time". It is when you go to the mountains on vacation, and don't spend much time adhering to a strict schedule, instead taking leisurely strolls around the fields and promising vague things like "I'll try to be back for dinner".


Closely related, yet distinct from, Island Time


3 years ago, when I started work for my current employer, I noticed in Slack that everyone was reckoning time in "Standard Time" year-round. Now imagine my chagrin because I live in Arizona, and "Mountain Standard Time" does not change for DST. Therefore, all my coworkers were citing nonsensical, nonexistent time zones and it was messing up my ability to convert back and forth.

Come to find out that this was some sort of entrenched, company-wide standard that was deliberately imposed. I made a lot of noise about this and appealed to some rather highly-placed directors, because I felt like it was wildly inaccurate and deceiving people; if you schedule a meeting in EDT but you say it's in EST, and we have employees all around the world, who's going to know? You're inviting off-by-one errors. Especially with me who lives permanently in MST.

3 years on, I've been unable to change this fundamentally; while a few people acknowledge DST, 90% of the company still adheres to this crazy false standard.


This is why I always write ET instead of EDT/EST.

I encourage everyone at my company to do the same. Easy way to eliminate errors while typing 1 less key stroke!


However, if DST is in effect and you live in AZ, you must write "MST" in order to be understood.


This is the way.


I just had someone asking me if I'm available at 5pm EST.

Also, your clock can get confused driving North from PHX to Zion National Park.

In summer you start in Mountain Standard Time, drive into the Navajo Nation which does observe Mountain Daylight Time, containing through the Hopi Reservation, which is Mountain Standard Time. Then you end up back in Navajo Nation with Mountain Daylight Time. You keep on driving towards Page which is in Mountain Standard Time. However, when you cross the state-border of AZ/UT you're back in Mountain Daylight Time.

My clock threw a segmentation fault.


One of the saddest pieces of code I ever wrote was to treat "MST" as always meaning America/Denver. I'm sorry.


I passively aggressively ask everyone during the daylight time who mentions specific time in EST or BST, if they meant EDT or BDT.

I literally had cases when I was woken up in the middle of the night for production issue because some people are too sloppy about this kind of thing.


How does this generate off-by-one errors? I am also part of a company with employees in pretty much every timezone, but when they create a meeting the meeting invitation is programmed with the correct timezone so in my Calendar it always shows what time the meeting is going to be for me. I never even have to think what timezone the organizer is...


The off-by-one error occurs when you announce an event in Standard time but really mean Daylight time, or vice versa. While those local to the time zone will often automatically correct this mistake either consciously on unconsciously, those in other time zones (especially where Daylight time isn't used or is on a different schedule) will tend to rely on time conversion tools which will take a literal interpretation of the scheduled time and result in the person being an hour early or an hour late.


The fact that you have to announce timezones is already an error. If I need to schedule a meeting I don't need to select timezones, they're already selected from the timezone I'm part of. There's never room for error by "picking" the wrong thing, since there's nothing to pick. And if my system is programmed with the wrong timezone, then every single meeting will be off-by-N and my calendar will show the wrong time as "now". It would be impossible to miss such an error.

I think your company needs better tools to handle meetings.


It's the same at my company. Teams and Zoom both automatically schedule meetings in every attendees' own time zone. Maybe that person's company still does phone meetings or something.


We don't use any automatic scheduling with Zoom or Google Calendar. Management doesn't send invites to those meetings, they just publish the link on Slack and we have to figure out how to get it into our calendars.

Trust me, at least once I missed a meeting because I was late by an hour due to time zone confusion.


I mostly struggle with Irish Standard Time (used for DST in Ireland) and Indian Standard Time which have the same acronym. :(

Thankfully, I learnt a long time ago to use ISO 8601 and UTC for dates and times. I still revert to PST/PDT if my audience is primarily left coast based.


And I can't say it's ever actually caused a problem, but something about Indian Standard Time being a half-hour offset from UTC has always bothered me so much... But now we're fully off-topic.


Oh, hold on to something, while I tell you about Chatham Islands.


> I mostly struggle with Irish Standard Time (used for DST in Ireland) and Indian Standard Time which have the same acronym. :(

Heh. After the first few instance of confusion, we switched to saying Bangalore time and Dublin time.


Left coast? That's a term I've never heard of.


Not to mention they conflict. CST can be "Central Standard Time", "China Standard Time" or "Cuba Standard Time" and so forth...


And if you've been American since birth, and live in Arizona, one might still not know, since PDT and Mountain Time alternate covering Arizona seasonally. ("Ask me how i know.")


It can also varies within Arizona... one of the most confusing times in my life was driving from California through the Navajo Reservation in AZ on my way to an appointment. Was my cell phone giving me the local time on the reservation? Was it connecting to a cell tower just outside the reservation, giving me DST-less Arizona time? Or a tower slightly further away in Utah (DST?) Or was it giving me the time on the Hopi reservation, which is an enclave totally surrounded by the Navajo Reservation which uses AZ time?


Unnecessary reflux obtained during US-Australian collaboration from insufficiently specific references to "east coast time".


Even in Australia, AEST has a DST flavour and a non-DST one. Queensland does not observe DST while the other states do. You can drive around a roundabout at the border and switch timezones for fun. Or go down there to celebrate the new year twice.

Or go from rabbits being OK to some 5 figure fine if you're caught with one :)


my limited understanding of local pejoratives suggests that it remains traditional to sledge QLD for being one hour and twenty years behind


I believe its traditional to reply that NSW was 8 points behind in the only yard stick those north of the border care about.

https://en.wikipedia.org/wiki/State_of_Origin_series


Probably if you're using AWS, you do, but it would be much more convenient if they just used UTC by default with an option to localize.


And even if the time is UTC, please indicate this.


> Life is too short to remember what each timezone name means and converting to it, UTC offsets are much easier on the mental calculator.

Many people also get the timezone names completely wrong. I've had multiple scheduling email exchanges where someone says X pm EST not realizing that at the time it's currently EDT and that EST ≠ EDT.

And yet, for some reason, the two-letter abbreviations (e.g., ET) that are technically correct year-round, never seem to have caught on in the wild.

I've given up on the abbreviations and just say "Eastern" now to avoid confusion.


Fair enough, but please only use that with strictly USA audiences. (And remember that public information likely will not be targeted to strictly USA audiences)

Names don't carry any information intrinsically, they are only a reference to the actual information, and the offset information is pretty short, so why not just provide the information directly?

"X pm GMT-3" only requires the reader to know their own timezone offset, unlike "X pm Brasilia time" (which is inaccurately known as São Paulo time outside Brazil) or "X pm BRT", which requires the reader to both know what that timezone means, and their own (or, more likely, requires them to look the conversion up).

(And if the difference between GMT and UTC is significant, I hope it didn't take my comment to convince you about using offsets :> )


And if it's on a forum debating an event that's about to happen soon, I find the following extremely convenient:

- the keynote will start when this post is 5 hours old

- the rocket launch is scheduled to when this comment is 30 hours old


Basically just use the output of `date -u`.


It's locale-specific, which is not great.



Use `date -u -Iseconds`, please ;-)


    date: illegal option -- I
    usage: date [-jnRu] [-d dst] [-r seconds] [-t west] [-v[+|-]val[ymwdHMS]] ... 
                [-f fmt date | [[[mm]dd]HH]MM[[cc]yy][.ss]] [+format]



Let me guess, you are on a Mac.


Probably on an older Mac or FreeBSD. On my macOS 12.6, `man date` says:

    The -I flag was added in FreeBSD 12.0.


Yup. macOS Catalina


just report it in epoch seconds


I will pass you the address of a struct timespec, please fill it in.


Or stardate ¯\_(ツ)_/¯


Just make sure not to confuse it with epoch milliseconds!


Or Swatch Internet time (.beat time). No time zones, it's always UTC+1, with the day divided into 1000 beats.


better:

time encoded as a float of trecenti-seconds since year -8435 of the Georgian Calendar

why?

'caus it hurt to even just think about implementing that anywhere


Ignoring leap seconds?


Julian Date


that's based on UTC, so just use UTC?


based on is not the same thing though is it?

UTC is human readable even if it is not calculated correctly. yes, i'm saying that if you can read epoch seconds, you're not human. 1970-01-01 00:00:00 is always a give away that something is a foot


"anchored on" then? I might be wrong but we're both talking about showing time as distance from the same starting point are we not? One's just more human readable so that's why I say why not just use that? Seconds since can be miscalculated too, especially if current time isn't known/reliable


I wish it was based on TAI though.


Nothing worse than people who say "9 AM my time" I suppose it's OK if it's Pacific vs Mountain but even there Arizona doesn't observe Daylight, and parts of Eastern Oregon are Mountain, not Pacific.

Never mind dealing with India, Australia, etc etc.

OK to use local time in your statement, just say what that time is.


The inconsistency with timezones across different services in the AWS console has always baffled and annoyed me. Some places have a time without a timezone and I can never tell right away if it's utc, local time, or region time.


> The inconsistency [of everything, everywhere] in the AWS console

ftfy

AWS is powerful and very popular, but for the console, "it functions" must be the only condition the UI has to satisfy. Should every page use a unique table and sorting widget and UI language? Yes, please!

I'm assuming this helps them move fast, not having to coordinate with anybody or wait for a UI designer to tell them how it should look. But it's striking when compared to GCP.


I've been told that each service is responsible for their own UI.


> AWS can do so many things, reporting critical outage updates in UTC is not one of those things.

Thank you for reminding me about one of my biggest mildest annoyances from working at AWS.


Technically PDT is always 7 hours behind UTC. PST is always 8 hours behind. We just change which one we use twice a year. Pacific time makes sense when you realize Fremont is the center of the universe.


Yup, this is why I always say “US ET” (I'm on the east coast). I don't trust myself or anyone else to get it right, and if the other party is converting anyway, their conversion tool (google?) should be able to handle that. (Of course, the date is necessary but implicit, but that's usually fine too.)


True in theory, in practice people often get it wrong and use the incorrect one.


Indeed. There are Americans who will tell me PST, when they meant PDT but forgot to mention that. Now I have to track the American DST calendar as well as European DST calendar to do the conversion.

There are also people who tell me GMT (because they think that term means "the time in London") when they meant BST (because in summer, London doesn't operate on GMT).


Most Americans, at least non-engineers, will incorrectly say PST, EST, etc year-round when they actually mean PT, ET, etc.

I have found this site very helpful for linking people to when they are confused or using them incorrectly:

https://time.is/PT

https://time.is/PST

https://time.is/PDT

The time zone comparison feature is nice as well:

https://time.is/compare/0800AM_14_June_2023_in_Cincinnati/Lo...


The outage is in Virginia so PDT isn't even local time. On their status page they are asking users to access the console via a region specific endpoint like https://us-west-2.console.aws.amazon.com. Wonder if the PDT timestamp is because they have to serve the status page from US West right now.


The fact that which timezone is used in the announcement is a sign of progress... AWS announced it pretty quickly, gave nice updates, and seems to have fixed the problem quickly enough. I'm interested to see the postmortem...


When I was with AWS I advocate for ISO8601 "Z" whenever I could or need to influence, say internal systems.

If all systems talk this we'd save tens of thousands of man hours. Just do the conversion for us mortals, or other necessities. Tech side of incidents is definitely "system", I'd argue more often than not consumers of AWS are also tech side with systems in UTCs so health dashboards should also be a UTC first system. Doubt this could get prioritized tho


https://aws.amazon.com/about-aws/whats-new/2022/09/aws-healt...

If you login, you can specify what timezone for timestamps and for the text to be parsed into your timezone preference.


I've set the option on the below page to UTC

https://health.aws.amazon.com/health/status#settings

As I'm logged in, it persists across browser sessions.


It's worse.

Imagine you have two browser opening the same page, one showing UTC, another showing your local time.

There are no indicator showing which time zone is used. You have to mentally correlate "this browser windows has logged in..." with the time shown on screen.


After spending some time in the Canary Islands I realized how nice it was to be in UTC all the time and now I have my laptop clock set to UTC. Still contemplating whether I should set Google Calendar and my smartwatch to UTC as well. 8-)


I thought it uses your browser time zone, is it not?


No. It's all PDT.


Says the same here and I'm on the other coast.


It doesn't matter if your infra is in another region, because there will almost always be transitive dependencies on us-east-1. IAM is deployed in us-east-1 and there will always be a transitive dependency on us-east-1


I have never had a production issue in other regions due to a us-east-1 outage. The worst that ever happened was I had to wait to update a Cloudfront distribution because the control plane (based in us-east-1) was down, but the existing configuration continued working fine throughout.

I don't know what the architecture of IAM looks like, but somehow it's never suffered a global outage.

AWS is really, really good at regional isolation.


>I don't know what the architecture of IAM looks like, but somehow it's never suffered a global outage.

Authentication possibly, but the control plane has gone down preventing changes.


I have.

Not being able to update your existing resources is still an outage from a DevOps perspective.

It might be an API level outage vs an end-user level outage from your customer's perspective, but if the functionality is down, it's an outage.


I think the data plane is regional


Control plane will almost always be impacted, I agree.

Our data plane was fine (for example, ec2 instances and s3 buckets in other regions were fine).


Usually it only prevents changes, but the runtime isn't affected.


I thought there was some recent shift on making IAM multi-region?


So much for redundancy I guess.


https://health.aws.amazon.com/health/status reports:

  Increased Error Rates and Latencies
  Jun 13 12:08 PM PDT We are investigating increased error rates and latencies in the US-EAST-1 Region.
They list Lambda as the only affected service


I suppose "increased error rates and latencies" is technically true when the error rate is 100% and the latency is "until we fix it"


status page won't load for me. are they still hosting their status page on their own infrastructure?


Perhaps they should host it on GCP


Ouch


It is a perfectly cromulent practice to host a replica of your status page at your competitor.


They've added a dozen or so more as potentially down now. Anything that uses IAM, which I suspect is the core of the issue.


doesn't every service use IAM?


They have 41 services listed now.


certificate manager also down (I know because I tried to update an ssl cert for cloudfront which only allows US-East-1 ssl certs, maybe someone will eventually fix that to allow any region to have the ssl cert for cloudfront)


> cloudfront which only allows US-East-1 ssl certs

This seems like an odd limitation. Do you know the technical reason?


CloudFront is a global service according to AWS (I believe you pay more if you want your content in CDNs in more/different regions' edges).


the status page doesn't even open.


My Whole Foods grocery pickup order was affected by this outage. They couldn’t check me in. Groceries were packed in the fridge but they told me to come back later. What a waste of time.


Are you just guessing it's related or is there a reason to be sure? Either way, does seem sjll6 to have no manual fallback 'check-in' (or policy-level ability to bypass the need for it) mechanism.


Most retail stores simply stop if "the system" is down.


I've known 'just take your shopping and go' and 'let me make a note of your details', but never 'put your bags down and bugger off but please do comw again'.

And why would ordered and paid for shopping be so inaccessible? It's not going to be locked away in a vending machine type thing right, you just need to show up with ID/log-in and collect it. Or so you should.


I’ve only known “sorry, the till is down” and line ups as far as the eye can see.

It's ridiculous that everything stops, but I do understand.

The pickup is even more ridiculous. Like you say, it's already through the system. They would just need to mark it as delivered later.


The Whole Foods employee working the pick-up area told me.


>They couldn’t check me in.

Are you an Amazon Flex driver? Or do you even need to check-in to pickup your order as a customer?


Worth knowing: today and tomorrow is AWS re:Inforce 2023 https://reinforce.awsevents.com/.


I wonder if this is a coincidence or if us-east-1 is simply down enough that I'm just experiencing selection bias; but I posted a poll on twitter earlier today: https://twitter.com/dijit/status/1668678588713824257

Contents:

> Has anyone ever actually had customers accept an outage because AWS was down; or is this just cloud evangelicalism copium?

> [ ] Yeah, outages free pass

> [ ] No, they say to use AZ's


> No, they say to use AZ's

Using 3 AZs in us-east-1 won't save you.

I guess a demanding customer would have said 'you should have implemented disaster recovery so you could failover to us-east-2' but that's easier said than done. The more regional AWS services you adopt, the bigger the impact is. How does one recover from a regional outage if their pipeline is in that region?


What I did once I was in the position of _having_ to provide that level of support, was to run the pipeline in a third region, different from the "prod" ones. That way, worst case you can't do deployments during the outage...

Another alternative studied was to use a thirdparty ci/cd service, outside of our network. It was discarded bc you never know where that would actually run


> It was discarded bc you never know where that would actually run

Yep, I considered that switching to GitHub Actions would _theoretically_ eliminate the need for disaster recovery for CI/CD (since the handling of disasters is out of your hands) but in practice their SLA is far worse than just running CodePipeline in a single region.


Yeah, that's why we went with a third region instead. But, at the end of the day, if _only_ changes are affected for a couple of hours, that wouldn't impact the service that much


I’ve worked for several systemically important megacorps where certain things had to not only run cross region but also cross provider. It’s absurdly difficult, and only should be done if you need five or more 9’s of availability. Almost nothing actually does.


it's important to inform customers about the resiliency of their systems and let them pick how far they are going to invest for it.

then you get to eat popcorn when stuff explodes.

  * single server event.   $
  * multi server event.    $$
  * single az event.       $$$
  * multi az event.        $$$$
  * global provider event. $$$$$
  * cross provider event.  $$$$$$
  * alien invasion.        $$$$$$$$$$$$$$


Back when we had servers in an onsite DC we lost a raid card and the system I was developing went down. We had the fancy support so a tech was out with the card replaced in a couple hours, then we had to restore from tape backup. All in all, a non-critical system was down for most of a business day. My bosses boss stormed in, upset he couldn't pull a report, and asked how do we prevent this in the future. I responded at a minimum we had to double the cost for a hot standby, and he said 'never mind' and walked out.


That sounds nice. My boss's boss is usually the one storming in, and he usually says "okay let's do it", and then I have to implement it in a week...


That’s why you always BofH the estimates to include some fun toys for yourself, too


"Briefly describe the '$$$$$$$' through '$$$$$$$$$$$$$' situations. Can't leave money lying on the table."

- memo from Enterprise Sales Dept.


Alien invasion resistant is spy novel / agents of shield level of multiple, redundant, isolated, off the normal books, safehouses + bases.

Short of alien invasion level are strategic military resistance levels to global/regional wars with differing levels of weapons and devastation.


Just need to deploy your service on Mars AND Earth. Duh


Multi-planetary-AZ DB cluster deployments. Putting the emphasis on "Eventual" in eventual consistency. Go for a walk before retrying reading from this replica!


And you thought the current time zone confusion was bad. Now you have two sets of time zones, and a varying delay of about 5 to 21 minutes between them. Oh, the joy!


note to self: synchronous replication may be a problem


Always be prepared for alien invasion


This should be logarithmic


The nice thing is that any graph without a unit can be log-scale - so in a way, it already is.


My employer lets customers choose which of our supported regions to run in and exempts cloud provider outages from our SLA (we’re on the hook for staying up for single AZ outages, but not multi AZ or region outages). We provide tools to help customers replicate their data so they can be multi-region or even multi provider if they want to.


AZs don't really help when it's AWS' own services across the entire region that break. Anecdotally, we have had customers accept outages that were out of our control without penalty.


Wild, that wouldn't have flown with datacenter providers having issues for my previous companies.

AWS really does have an easier time than old school datacenter providers. I guess the complexity is higher but it's shocking that they can charge so much yet we hold them to a lower standard.


DCs are pretty static and offer way fewer services than AWS or any other public cloud.

I worked for one for some time and whenever we had issues, some people would call and ask if we were going bankrupt. It gave me a feeling they also have way smaller customers that might not understand the underlying stack.


If all you use in AWS is static EC2 instances you would have to go back a looooong time to find an outage which affected their availability. Even in us-east-1.


December 22, 2021 was the last partial impact we had in us-east-1 for EC2 instances. They had power issues in USE1-AZ4 that took a while to sort out.


Outage rates are also wildly different. When you're using dozens of managed services and have a few prod-impacting outages with any reasonable (cross-AZ) design, customers are less sensitive then when they are dependent on dozens of products that hav independent failure modes with potentially cascading impact.


AZs also don't help with natural disasters at all. I believe AWS is the only one doing geographically distributed AZs, for the others it just means different connections and placed somewhere else in the building.

edit: turns out AWS is the one with geo distribution, not Azure


Aws azs are also distributed geographically within a region w separate power and network lines. From the docs “ Availability Zones are distinct locations within an AWS Region that are engineered to be isolated from failures in other Availability Zones.”


Ah, you are probably right. I was thinking of the incident a few weeks back where the fire suppression took out multiple AZs, but that was actually GCP.


Depends on your customers.

If your customers are tech, they're too busy running around with their hair on fire too.


> Has anyone ever actually had customers accept an outage because AWS was down...

Whether customers "accept" it or not just comes down to what's in your SLA, if you have one in the first place, and if they are on a contract tier that it applies to. [Many servies provide no SLA for hobby / low tiers, beta features, etc.]

Firebase Auth, for instance, offers no SLA at all [1].

I would be curious to see statistics across a range of SLAs for what % include a force majeure or similar clause which excludes responsibility for upstream outages. I would expect this to be more common with more technical products / more technical customers.

[1]: https://stackoverflow.com/a/60500860/149428


I can think of more times where a whole AZ has had issues than times where just one AZ went dark and failover happened seamlessly.


s/whole AZ/whole region/


Maybe cheaper regions have more users and have higher outage rates


Mysterious lack of "AWS is bad for the internet because it is so centralized" dialog up in here.

edit: for those that would downvote: HN _just_ yesterday: https://news.ycombinator.com/item?id=36295352 https://news.ycombinator.com/item?id=36295305


Ok fine. Running your own datacenter in 2023 is incredibly risky. There's the upfront server cost and the ongoing maintenance cost. There's patches and staffing and disaster planning and all the other things that goes into it. Plus there's the cyberinsurance and protections and security components too.

Do you really think other (smaller) orgs can do a better job at hosting a datacenter than Amazon / Google / Microsoft / Cloudflare? They have some of the brightest minds in the industry working there, and they can price things at a much better price than anything you can build yourself.

Yes, I get it. All the computer processing power in a handful of actor's hands is probably not the most fantastic thing. However with the price of some cloud vendors compared to the DIY approach, it's hard for organizations to ignore.

If you really want to combat this, make the cost of running your own data center less. Reduce risk. Reduce the amount of money it costs for hiring good people or MSP's. Reduce the cost of acquiring and installing hardware.

Organizations pay attention to dollars so if you want the trend to shift, come up with a less costly alternative to the current cloud offerings.


Professional data center hosts host data centers very very well.

Amazon et al even contract with them.

But cloud isn’t selling rented rackspace; it’s selling APIs for billions of things. Much different.


> Running your own datacenter in 2023 is incredibly risky.

There are middle grounds.

But let's be honest: 99% of companies have never done the napkin math, because nobody ever got fired for choosing IBM^W AWS.

We joked about this in my company: we had a variable-load thing that we used autoscaling in the cloud for, but it had a baseline load that purchasing a real machine might have made a lot of sense for. The napkin math probably checked out. We never suggested it more than jokingly, though, because even when we suggested it jokingly, we got shut down: "You don't understand the cost of that." No, actually, we jokingly did enough math that we do understand, better than the people criticizing us did. We never did it.

Whenever the "own it" argument comes up, eveybody is real quick to hop on the "but maintenance cost" train. But as I perceive it, those who believe in the cloud budget exactly $0 for maintenance of managed cloud resources. As someone who's only done cloud, that number is unadulterated bullshit: the number of hours I've had to spend chasing cloud vendors to do the job that we're paying for is just silently flying under the budget radar. In the minds of the finance books, I'm 100% SWE, but in reality, I'm 75% SWE, and 25% support ticket monkey.

At least with a real machine, it'd be interesting, and I'd have some agency to actually solve the problem. As a support ticket monkey, I'm utterly powerless. I'm tired of having to beg.

That's not to say I'd move everything off cloud; I actually think the vast majority of what we do is well-suited for cloud, mostly because upper management can't make up their mind about product direction enough to be able to say "yes, we can purchase this and we'll use it." But those nuggets of stability do happen from time to time.

> disaster planning

"Disaster planning" is something every org wants, because they're trying to tick the box with the regulator. But the requirements that get passed down border on absurd: "what if a meteor hit AWS and they were never able to recover from it?" … we're literally never going to plan for that, because the $ needed for that level of eng. work is not going to happen. A sane scenario would be "can we handle an AZ outage?" (or, let's start there, and maybe, maybe if we can get that down pat, then we can graduate to regional outages.)

> cyberinsurance

… you don't get out of this via being in the cloud, if you need it. (I wish we did, because ours pushes some utter inane requirements.) I can mismanage a machine in a DC just as easily as I can mismanage a VM in the cloud.

> Organizations pay attention to dollars

No they don't. This oft-repeated mantra is nonsense. Finance dept. get an invoice that has a total; even were they to have access to the finer billing information, they're not technical, and cannot understand it. I've yet to be at a company that's dedicated sufficient resources towards infra eng such that we could do the legwork necessary to present a sane organizational view of what cloud infra dollars go to what high-level objectives or teams. The resource tagging isn't there, and even if it were, some things cannot be tagged, and you still have to aggregate bills from a dozen different vendors, and then figure out what weights to apply to shared resources across OUs. I'm on employer #4? and have yet to see anyone scratch the surface of that.

Which is why you see articles about cloud $ waste all the time.

What happens far more often in my life is someone from management descending with "why are we spending $X on Y?", where $X is usually an order of magnitude wrong, or Y is … something we're not even doing anymore? And then you have to go round the mulberry bush of "how did you arrive at that figure?" "okay so here's what those numbers mean" "here you're adding $/mo and $/yr and you can't do that"

> Do you really think other (smaller) orgs can do a better job at hosting a datacenter than Amazon / Google / Microsoft / Cloudflare?

Than Microsoft? Absolutely yes. The others, probably not.

> Yes, I get it. All the computer processing power in a handful of actor's hands is probably not the most fantastic thing.

The long-term end state of not investing money into R&D is that it is centralized into those who do, and you become beholden to them. You get what you pay for, here. It's not good, and I think there's discussion to be had around that, but my real problem is the cognitive dissonance that follows. If you want to centralize on one of the cloud duopoly, then you also need to acknowledge that your own eng cannot be held responsible for the cloud's reliability: they have no control over it.


Excellent response, thank you. To expand on the cybersecurity aspect - think of services like Cloudflare WAF and DDoS protection. These services are very easy for orgs to implement and do a really good job at covering 95% of threats quickly.

Could you imagine a 1,000 person org with a 20 person IT department rolling their own DDoS solution?

But yes you are also right, cyber insurance is still required, and even AWS touts an expectation of a “shared responsibility” model.

I’m still skeptical that cloud hosted offerings are a bad thing. For a long time there were only Ford, Chrysler and Chevy in America, then foreign imports became popular, then a few years ago Tesla became a contender.

I still think new entrants can come into the cloud space, particularly in Europe, but they need to do their due diligence and understand their competitors offerings very well.


its just tired at this point.

everyone knows, nobody seems to care.

Another comment of mine in this thread asks the question if you can excuse downtime of your service due to AWS outages.

Consensus seems to be: yes

which is a pretty huge deal, well worth the insane cost increase of AWS by itself. No other hosting provider would grant you such an excuse.

I would weep for the centralised future of the internet, but its already here, so theres no point.


Even if people _do_ care, there isn't much to do about it.


If people do care they could use other hosting providers such as Hetzner or OVH, no?


Why did you come to the conclusion that Hetzner or OVH is more reliable. At least their SLA credits doesn't say that.


I don’t think any individual provider would be significantly more reliable. But it would make the landscape more diverse if people did not put all their eggs in the big three providers.


The economics of the situation make it so that at your job that's not very realistic.


It's a mob mentality. Safety in numbers. "Oh well, my site is down but so is my neighbour's so nobody will be that mad about it."


which is legitimate - if only you're down then you're losing business to your competitors and failing those who rely upon you; if everyone's down it's a wash. And frankly it's not like you're going to have significantly better uptime by going against the crowd.


Too techie and doing things the right way, so CF shouldn’t be successful? Therefore… jealousy? That’s my guess as to why all the hacker news hate.


Why a throwaway for this post? Not like this is some deep whistleblowing or career risk.


Maybe they work for Amazon.


I "love" it when my vacuum stops working because an online book sellers servers went down. #modernlife

This is a good reminder to avoid cloud-centric products, but they are getting harder and harder to avoid.


Did this actually happen? The vacuum part


Yes. This morning I found my Roomba (i9 something IIRC) sitting idle in the middle of the kitchen. When I launched the iOS app, it appologised it couldn't connect due to AWS being down (I regret not taking a screenshot).


Why is it always us-east-1 though?

I have always stayed away from that region because it seems significantly less reliable than other regions.


It's the:

* Largest (DDoS'd most, most complex, scaling issues etc)

* Oldest (More time for weird idiosyncrasies to take hold)

* Where most testing happens

* Where new products are deployed first


1) and 2) certainly apply. 3) and 4) don't. Testing in the largest region is one of the biggest anti-patterns.


4 is still generally true. Most new features drop in us-east-1 on launch day.


Usually us-east-1 is deployed to after several smaller regions. Usually it'll fall in the middle of the week depending on the pipeline.

Just because a feature is there on launch day doesn't mean it was deployed to first. Features are often hidden behind flags that are switched for launch.


I'm well aware of that, but the point is that when the feature is ungated to the public, it's in us-east-1 and gets all that load, and more load than the rest because of the fact that a lot of big customers are based in us-east-1, including much of Amazon itself.


AWS doesn't test there last I checked, they roll out to smaller regions first.


Most AWS engineering is closest to (and tested in) us-west-2 (PDX) or us-east-2 (Ohio)


It's also the home of single region services...

IAM, Cloudfront ACM certs, etc


Those are not single-region services. Changes must be executed there, but the data is replicated globally. If you don’t need to make changes in the context of those services, they will keep working in the other regions even during an incident in the primary region.


It's also

* The only place where the IAM dashboard can be accessed from. I need to access it NOW. I can't.


Looking forward to Auckland coming online, which should be the opposite to most of these factors, and will make game streaming bearable (for me)


us-east-1 is the largest region, so it is where changes meet scale.

It is also a massively complex beast in itself spanning dozens of datacenters with massive amounts of fiber between them. Much more fragile than having everything in a single building and as you scale up the number of components you increase the rate of failure.


No AWS region is in a single building, they aren't amateurs like Azure. Each region is at least 3 AZs, which is at least one physical DC.


And yet it's AWS that's down.


Touché. Still I'd rate the overall reliability of AWS higher than Azure; and even if that weren't the case, security issues make Azure look like a very poor choice.


I actually just wrote about this very thing. It's not just that it SEEMS less reliable, it absolutely is:

https://statusgator.com/blog/is-north-virginia-aws-region-th...


I don't think this article has any value. Are you only counting region wide outages? US east is probably 10x the size of any other region with more AZ's than any other region.


I suspect it's where they concentrate a lot of their control plane.


us-east-1 is AWS's oldest region, and has the most legacy infrastructure, in ways that many other regions do not.


I thought I read that this is where they deploy new changes first. Can anyone confirm?


No definitely not. Usually pipelines deploy over 1-2 week periods, and they don't deploy on Fridays/holidays/high-traffic periods like December.

Deployments start off very conservative, maybe 1-2 small regions on the first day of deployments. As you gain confidence, the pipeline deploys to more regions/bigger regions.

A pipeline that deploys to 22 regions over one week might go from 2 small regions on monday, 4 small/medium regions on tuesday, 8 medium/large regions on wednesday, 8 regions on thursday.

us-east-1 is usually going to be deployed to on the wednesday/thursday in this example, but that isn't always the case because sometimes deployments are accelerated for feature launches (especially around re:invent), or retried because of a failure.

There are best practice guides within Amazon that very closely detail how you should deploy, although it is up to the teams to follow them, which they usually do an okay job of.


I don't believe it's true. I was working on one of the biggest AWS services and we always deployed to small regions first.

@dijit is right: https://news.ycombinator.com/item?id=36315736


I have a suspicion that AWS uses some regions as canaries. Because we control both ends of things, I have personally noted that certain AWS functions clearly break in Australia first.


When I worked there, there were few hard and fast rules. Every team had its own release processes, so there was a lot of variance. It has been a couple of years, so this may have changed.

Typically, a team would group their regions into batches and deploy their change to one batch at a time. Usually they follow a geometric progression, so the first batch has one region, the second batch has two regions, the third batch has four regions, and so on. This batching was performed for the sake of time; nobody wants to wait a month for a single change to finish rolling out.

One reason not to deploy to us-east-1 in the first batch is so you don't blow up your biggest region. The fewer customers you break, the better.

One reason not to deploy to us-east-1 in the last batch is that there are a lot of batches. If a problem is uncovered after deploying the last batch, then someone has to initiate rollbacks for every single region.

Some teams tried to compromise and put us-east-1 in one of the earlier batches.


When i worked at aws, IIRC, us-east-1 was one of the last regions we deployed to. So this is very confusing to me


From observing my wife's teams over the years, they deploy new _products_ early to that region, but deploying code changes starts in smaller regions.


Because it was one of the first, and it shows its age and less than rigorous rollout compared to the other zones.


My son delivers part-time for Amazon and all the drivers at his warehouse were sent home. So if your delivery is late or non-existent today....


Amazon Flex?


Yep


I heard they are paying Flex drivers for that day. Well, most of them.


They did.

Say what one wants about Amazon or how they treat delivery drivers. My son told me of some experiences he's had--including having his car totaled on an icy river bridge at four in the morning--and Amazon was on top of it, empathetic, and did more than what was expected.


I guess I'll use the downtime to see what's new on Reddi... oh... yeah.


this happened less than an hour after I altered our prod scheme, thought I brought down production, what a relief


No, you brought down us-east-1 instead! Thanks!


This is why you don't give prod credentials to the new guy!


I had a similar reaction. Oh no WTF, how did i break that?! Then my buddy texted me about us-east-1 being down. Then i thought "Oh thank god, this shitshow is someone else's fault."


If you're having problems accessing the console, the workaround is just to use a different region, eg:

https://ca-central-1.console.aws.amazon.com/console/home

This assumes you don't actually need anything from us-east-1, though :)


Why does everyone keep deploying their products to this one region when it always seems like the one that fails?

We don't use big cloud were I work, so maybe I'm missing something. Does East-1 offer something other don't?


You generally want to use a region close to your users, so right off the bat, us-east-1 and us-east-2 are the obvious choices for most East Coast companies. If I were starting a new project, I'd probably go us-east-2, but if your company has been on the cloud long enough, us-east-2 might not have existed when your foundational infrastructure was created. And for most companies, going multi-region is an expensive, difficult proposition that might not be worth it.

Plus, as others have noted, there are critical AWS services in their control plan that only run in us-east-1 behind the scenes. So you're kind of out of luck.


There are some services (cloudfront for example) which require this region. Its not that much harder to have multiple regions in your deployment but putting everything in one is simpler for smaller startupy orgs.


It’s not required during an incident. The data for these services is globally replicated. It’s only if you need to make changes that you might be impacted if you’re already successfully operating out of another region.


There's a lot of software, iirc even Amazon's own dashboards, that simply defaults to us-east-1.


That's my favorite part of Amazon's console. That miniature heart attack you have when you ask "WHERE ARE ALL MY LAMBDAS AND DYNAMO INSTANCES?"

Then you realize that they just switched you back to us-east-1 for some reason and a wave of familiar relief washes over you.


Out of curiosity have you been building with the serverless framework? I'm curious what drives people to use dynamodb in particular


No, I really don't like any of the serverless or amplify type of frameworks that AWS produces.

What I needed was a distributed store for managing a large number of small configuration "files" and other state "links" across the infrastructure I built. In particular, I needed the ability for a write conflicts to be detected and managed immediately and for consistent reads to be available in some cases.

Looking at the size of data and number of transactions, DynamoDB on a per-request model was going to be significantly cheaper and easier than standing up a bunch of DB instances or other "fixed" infrastructure.

The ultimate model was to actually implement something like a POSIX filesystem on top of the "nosql" layer. In practice, I've been really happy with that decision, and it makes working on the code that interacts with this system very familiar and easy to understand. It's even got symlinks, acls, and automatically expiring advisory locks.


FWIW, mine defaults to Ohio. Has happened multiple times. IDK if it is geographic or what.


I used it because early on in the project I wanted to use features for IoT that were only available on us-east-1 initially, as well as lambda@edge which was on us-east-1 only at the time.


Instance types, for one. us-east-1 has all the latest instance types and more of them. We could not run some of our workloads in any other region.


Best latency from where almost 90% of my users are?


Features.

For example, do you want your Cloudfront CDN to have a custom (secure) domain?

Then you have to host your ACM cert in us-east-1.


If you’ve already set up your ACM cert, you won’t be impacted during an incident in us-east-1. It’s only if you need to make changes that you could be impacted.


It has the best latency from Chile and I think from other countries of South America


It's dirt cheap.

But I still prefer EU region =).


Anecdotal with n=2 sample, but GPU availability seems better in us-east-1.


Because $$$


Toast POS is down 100%, don't go out to lunch.


As a side note, I wonder if businesses won't even accept cash if they can't go through their POS system. If not, it's a shame that these modern internet connected POSs lock out stuff like that.


Depending on country and exact POS setup, they might not be able to take cash if POS is down.

For example, in Poland, your typical restaurant or shop needs to generate tax receipt (as well as properly calculate the tax), and uses either a separate receipt POS device, or POS with appropriate receipt printer (the devices are certified and for example do simultaneous two prints - one for client one for seller - or use digitally-signed storage for seller copy).

If the POS isn't designed properly to operate in case of network failure... welp, can't take cash either, at least not legally.


some restaurants have their owner or manager run square or stripe on their cell phone.


I was at a swim meet last week, and one of the food trucks was using Apple Tap to Pay because Toast didn't have a solution that worked for them, on site. After they finish up at an area, they then enter a single transaction for all of the day's business into Toast.


I don’t know about Stripe but Square used to offer the CC swipe dongle you can connect to your phone. Then process payments through their app


I can see some restaurants just comp’ing the tickets out and having toast foot the bill in lost sales


You say that as if it's as easy as sending Toast the bill and Toast just going "Yeah okay we'll pay".

When the POS system goes down, restaurants take down credit card numbers, and then charge them later when the POS comes back up.


Toast offline mode captures the credit cards and processes them later, no reason to turn away sales, it is just a hassle.


they're in a single region?


Netlify is down as well. https://www.netlifystatus.com


It's fun watching each service fail sequentially while the aws service dashboard just updates them to "Informational" status, whatever that means.

Even management console is down, and their suggested region specific workaround does not work, at least for us-east-1. I can see some processes via api but I don't have code prepared for monitoring every service from my local.


And now the service health page is down.


finally an opportunity to test a full deploy from scratch, and restore from backup, in a new region.

i wonder if it will work first try? the true test of devops culture.


Good luck! My own attempt failed because SSO is down.


https://health.aws.amazon.com/health/status now it is showing. Lot of services are impacted.


It appears to be an outage in IAM which is trickling down to every service which relies on IAM auth.


But IAM is supposed to be Global, not us-east-1


All the regions are equal, but some regions are more equal than others.


In that it globally depends on us-east-1.


As I’ve said elsewhere, this is not accurate for existing IAM resources.


us-east-1 seems to be very 'special' compared to the other regions - I wonder if they will ever align it with the rest of them.


Yep, it has issues so frequently. I wonder how many companies/teams start using AWS and blindly choose us-east-1 without realizing what they're getting into.

<rant>

It's also quite annoying sometimes that some things _need_ to be in us-east-1, and if e.g. you are using Terraform and specify a different default region, AWS will happily let you create useless resources in regions that aren't us-east-1 that then mysteriously break stuff because they aren't in this one blessed region. AWS Certificate Manager (ACM) certificates are like this, I believe.

</rant>


ACM certificates themselves can be had in any region (and you can use them for stuff like ELBs), but since the Cloudfront control plane is in us-east-1, if you want Cloudfront (and IIRC, also if you want custom domain names for an S3 bucket, but don't quote me on that) you'll have to create an additional certificate in us-east-1.

Sigh.


I think a lot of companies just do everything there and pinky promise one day they'll go multi-region.


I worked with a devops person who moved everything we had set up in other regions _to_ US-East-1 because that is where you are supposed to run stuff. According to him, the other regions were just for DR stuff.


Surely not an AWS certified devops person? I don't think they teach mythology!


What's interesting is that I can still access my EKS cluster, but none of the deployments are "ready" that have LBs attached to them. Pods can create fine though!


I kicked off a Redshift cluster in every region, they've all run and completed, except for `us-east-1`, which is stuck creating the cluster. Been about an hour now.


Seems like it took IMDB with it. Surprised that Amazon is not able to keep their own property up when one of their zones goes down. Not a great example.


We're on use1 and havent seen any degradation


I'm not sure it's the case here, but the issue with these cloud providers is they use their own services to maintain their infrastructure - that's why when something like lambda gets degraded, which would not shock me if they're using everywhere, you start to see random crap like console and IAM go down as well.


Weird, I didn't notice the actual outage at all except high ping to a non-AWS server/IP on the west coast. Normally the latency is ~85ms, today it has been >170ms. SSH is basically unusable at that latency and even bandwidth is very low (not sure how that could be).


You'd think the largest cloud would be resilient to a zone going down, but I guess not.

i find it stupid that clients get to know about regions at all. They should only notice latency hits and batch job queuing latency if something bad happens underneath, but no services should go down.


> At that time, we began processing the backlog of asynchronous Lambda invocations that accumulated during the event, including invocations from other AWS services. As of 3:37 PM, the backlog was fully processed

pretty cool that stuff was stored in a backlog and eventually processed!


I don't want to diminish anything, but in terms of dealing with queues, backlogs are pretty much a basic requirement from the beginning.


I am guessing they mean SQS queue … so basically the SQS aid doing what is supposed to: - try to process event - send to DLQ if it fails - there’s a redrive button in the DLQ to … well restive the events after the lambda is fixed


Both Vercel and Netlify went down with it.

I wonder what % of the internet went down because of the us-east-1 today.


We are seeing console, codebuild, etc. access issues. Possibly all using Lambdas, foundationally?


the last big us-east-1 outage was ... DNS - and it's usually DNS or software-defined core networking causing these cascading failures

Loss of DNS causes inter-service api calls to fail, then IAM and all other services fail. Anything not built to handle those situations with backoff causes a 'stampeding herd' of failure/retry and exacerbate the outage

Review the AWS statements about outages here - https://aws.amazon.com/premiumsupport/technology/pes/


Other submission linking to AWS status page: https://news.ycombinator.com/item?id=36315441

And what a surprise it's US-EAST-1 again...


Is there an estimate for the time they will take to solve this?


2.5 hours, from start of incident.


I love how their status says that services are just degraded


They've just acknowledged degradation with lambda


Has anyone checked if Washington, DC is still there?


How many folks actually use multi-region deployments with automatic failover (e.g. latency based routing in route 53)?


The difficulty of making this work in practice is pretty high. It also isn't cheap, so I would guess not many.


Some parts are expensive (EC2, NAT and VPC endpoint proliferation) but others are simple and inexpensive (regional API Gateways, latency based routing, DDB global tables, lambdas, state machines).


Guesses, which one is it this time:

  * DNS
  * Misconfigured switch
It's always one of those two.


* Cert expiry


You can't even access the tools (web or CLI) in order to put your own system into maintenance mode...


The AWS status page is now down as well


https://health.aws.amazon.com/health/status seems to be working fine for me (been refreshing it every couple of minutes).


I posted a comment about AWS Media Convert down below, but it's back working for me.


Is this possibly why my Ring doorbell omitted a phantom chime about two hours ago at 4am?


Does anyone know if the entire region suffered the outage, or one/some AZs?



Can we find stats somewhere, something like number of outages by regions?


Maybe azure was hosting some services there, getting ddos'd and all.


It's always during the demos to the stakeholders, isn't it?


I'm so glad my demo today was specifically about local inference on... Windows. I guess working I finally found an upside to doing ML outside Linux ; we don't have Windows VMs on AWS!! :)


Most stuff is back up, but AWS MediaConvert still seems to be down.


I can't even update my billing information right now :/


anyone have a link to the AWS summary / postmortem? Saw it on their status page the other day, but can not find it anywhere now.


It appears to have been fixed and resolved for us.


s3 is down for us


Looks like it is a problem with lambdas.


anyone getting Gateway Time-out?


yep can't even get into console to diagnose / troubleshoot / fix


CLI works OK still.


eu-west-1 having some issue as well for me


can you elaborate?


getting Gateway Time-out


def broken for us


issues with STS and IAM galore #thisIsFine


b-but muh five nines of reliability...


Yes


Seems like it's time for end of the quarter AWS demos and someone got a little too eager for launch.


Tried to log on to OkCupid and it was down, I guess I should thank Jeff for taking me off the Skinner Box for a while.


No issues on our Linodes .


I'm sure aws re:Inforce starting has nothing to do with it... https://reinforce.awsevents.com/livestreams/


Had our login and other features go down less than an hour after I altered our prod scheme and thought I did something wrong, what a relief it was to see this


The second law of thermodynamics guarantees that there will always / eventually be downtime… and that’s okay. Design for downtime. Shameless self-plug for https://heiioncall.com/ for free website / HTTP endpoint monitoring and cron job monitoring if you want to know about your own app’s downtime.

EDIT: thanks to those of you who have signed up in the past few minutes! Let us know if you have any feedback.


"eventually" is doing a lot of work in that sentence. Our sun will last another 5 billion years, and the heat death is something like 100 trillion years away.


Sure :)

Though a lot of practical thermal-related causes of electronics failure seem to operate on timescales of years to decades, like electromigration https://en.m.wikipedia.org/wiki/Electromigration or even just cooling fan bearing failure. And I don’t think it would be a huge stretch to point to electromigration as a case of diffusion, a natural entropy increasing process, re-randomizing the arrangement of atoms within a transistor (and therefore making it fail eventually).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: