Tell HN: AWS appears to be down again

jrs235 · on Dec 15, 2021

I checked their health status page. All is good. /s

https://downdetector.com/status/aws-amazon-web-services/

tyingq · on Dec 15, 2021

They did add an update, faster than last time:

"7:42 AM PST We are investigating Internet connectivity issues to the US-WEST-2 Region."

https://status.aws.amazon.com/

Edit: They added US-WEST-1:

"7:52 AM PST We are investigating Internet connectivity issues to the US-WEST-1 Region."

Edit: Found root case, maybe?

"8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-1 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery."

"8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-2 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery."

ziddoap · on Dec 15, 2021

Too bad I am unable to load the status page due to connection timeouts, so I can't see the updates.

chasd00 · on Dec 15, 2021

someone tripped over the fiber run i bet. Or, a cleaning person unplugged a router to plugin a vacuum (that actually happened but to a minicomputer iirc)

darepublic · on Dec 15, 2021

Unfortunately the vacuum, a shiny IoT connected appliance, didn't work because AWS was down

Hamuko · on Dec 15, 2021

Usually the problem is "an idiot with a digger".

dylan604 · on Dec 15, 2021

nah man, it's never the digger that's the idiot. it's always the project manager that told the digger where to dig. just like it's never the dev's fault as the PM made them do it. /s

jve · on Dec 15, 2021

No way a cleaning person can do that in a datacenter.

erhk · on Dec 15, 2021

I hope that their infra is not that unstable

JshWright · on Dec 15, 2021

It's interesting that west-2 was quicker to create the incident (despite the issue starting a bit later there, at least by our experience), and while they both "identified" at the same time, west-2 also waited longer to call it resolved.

I assume there are different teams responsible for each, is the west-2 team just more on top of things?

tyingq · on Dec 15, 2021

West-2 also launched many years after us-east-1, so less legacy to deal with.

vineyardmike · on Dec 15, 2021

1.US-East-1 wasn't involved today.

2. They don't really have much "legacy" stuff to deal with since they likely turn over racks quickly across their whole fleet and software deployments should be standardized, so any US-east-1 flakiness has to do with the fact that its where amazon houses their control planes often.

ldoughty · on Dec 16, 2021

There's at least one AZ in East-1 that doesn't support nitro, and that's been around for 4ish years now...

I agree in principle, but clearly something is hobbling them because of (probably) legacy stuff

kulikalov · on Dec 15, 2021

The issue is not specific to the US, same issues in Europe. Also, it seems not only AWS experiencing issues. Unless Google is hosted on AWS haha...

tyingq · on Dec 15, 2021

Yes, it could be network peering related. But there's definitely a lot of us-west-1 and us-west-2 users complaining and people saying that us-east-1 seems fine.

jrs235 · on Dec 15, 2021

Seems to be resolved now. And seems they hid / took away any mentioning of possible issues. Sigh.

zaltekk · on Dec 15, 2021

It's still there now, on the top of the page, just marked resolved:

us-west-1:

7:52 AM PST We are investigating Internet connectivity issues to the US-WEST-1 Region.

8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-1 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery.

8:10 AM PST We have resolved the issue affecting Internet connectivity to the US-WEST-1 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.

us-west-2:

7:43 AM PST We are investigating Internet connectivity issues to the US-WEST-2 Region.

8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-2 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery.

8:14 AM PST We have resolved the issue affecting Internet connectivity to the US-WEST-2 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.

iJohnDoe · on Dec 15, 2021

That is a shame. Anyone coming in after the fact to investigate an outage or glitch with their systems will need to look harder to find a known AWS outage. We can’t assume everyone looks at HN.

savant_penguin · on Dec 15, 2021

Practice makes perfect

alvis · on Dec 15, 2021

So it is down again.

sgt · on Dec 15, 2021

Ok, so it can't be down then. This is proof!

ornornor · on Dec 15, 2021

Yep, when it loads, it's all green. "nine nines!!!"

wholinator2 · on Dec 16, 2021

I thought that sounded ridiculous so I did the math and 99.9999999 uptime allows for 1.314 _seconds_ of downtime every 1000 years. It would take approximately 2.7 million years to acquire just an hours worth of allowable downtime, that's how long it takes light from the second nearest spiral galaxy and farthest visible object to the eye in perfect conditions [1]. Within a single quarter of a year, that's 328.5 μs (microseconds) or about 1200 blinks of an eye [2] or about 3 times faster than a typical electric capacitor camera flash [3], also approximately, and interestingly enough, less than 1% of my current ping to my ISP let alone Amazon's servers.

So yeah, having done that I now understand that it was probably a joke but it really puts into perspective just how ridiculous things can get with a few 9's.

[1] https://earthsky.org/clusters-nebulae-galaxies/triangulum-ga...

[2] https://www.verywellhealth.com/why-do-we-blink-our-eyes-3879...

[3] https://en.m.wikipedia.org/wiki/Flash_(photography) (wikipedia's won't let me deep link on my phone, it's in electronic flash section under types)

buitreVirtual · on Dec 15, 2021

emphasis on when

adamisom · on Dec 15, 2021

60% of the time, it's all-green 100% of the time

Cort3z · on Dec 15, 2021

Down detector is just a statistical page, it does not actually detect downtime, and is in no way aws's status page.

drcongo · on Dec 15, 2021

What does downdetector run on?

NobodyNada · on Dec 15, 2021

User reports — i.e. the number of people who google “is X down” and then click a Down Detector link.

It’s a clever way of getting reasonably accurate data very quickly and easily, though it does have it’s flaws — the data is pretty noisy and users often attribute outages to the wrong service (e.g. blaming their ISP or Microsoft or something when YouTube is down, or vice versa).

ldoughty · on Dec 16, 2021

I would guess the user is asking what are down detector's dependencies... E.g. can their website function I'd us-east-2 goes down? Or a GCP equivalent? Or are they on a self-hosted server ? What would cause the metrics to be "off"

sam0x17 · on Dec 15, 2021

They really need to stop requiring SVPs or higher to show non-green status on the status page, as other HNers have revealed in last week's AWS post. It's effectively not a status page, and they could probably be sued if it can be demonstrated that X service was down but the status page showed green (since the SLA is based on status page). Should be automated and based on sample deployments running in every region and every service. And they should use non-AWS instances to do the sampling, so they can actually sample when, say, we experience the obligatory black friday us-east-1 outage every year.

lljk_kennedy · on Dec 15, 2021

I think SVP / GM approval is only needed for yellow / red status. From my time in AWS Support, the Support Oncall and Call Leader / GM delegate worked to approve green-i posts.

sam0x17 · on Dec 15, 2021

If my app won't run for reasons that are not my fault for longer than the SLA guarantees, the affected services should be at least yellow status and I should be accumulating free AWS credits.

ceejayoz · on Dec 15, 2021

They were much faster than usual about updating the AWS Status page.

stefan_ · on Dec 15, 2021

With some lame ass tiny blue "connectivity issues" informational text. Surely broken routing to two entire DCs is full red for all services available therein?

Like what, the networking is broken but if you could send packets, the services would still work so they are green?

Hamuko · on Dec 15, 2021

I was still able to reach our service running in us-west-1 when the connectivity issue was still on-going, so I don't know if it was a full interruption.

JshWright · on Dec 15, 2021

Our ~four person ops team shouldn't be able to have our status page updated 15 minutes before the upstream status page...

Isthatablackgsd · on Dec 15, 2021

I thought Status Pages or Health Pages is designed to automate the reporting and checking the status automatically. This was my impression when I came across those status pages. Apparently, it is not automated and only update it manually. What is the point of having a status pages if it cannot be automated? I'm sure FAANG and tech conglomerates don't want it to be automated because of SLA.

I'm surprised with FAANG hosted their stuff in their competitors cloud services without providing a fallback cloud service if the primary service is down. Sure it cost money but it would be effective this way than putting all eggs in one basket.

hatware · on Dec 15, 2021

As stated earlier, AWS has financial incentive to not update the status page. Nobody is willing to call them on the conflict of interest in a meaningful, market-changing way.

philistine · on Dec 15, 2021

Perhaps someone could produce an alternate, Patreon-supported status page that accurately reports on the status of AWS services.

sam0x17 · on Dec 15, 2021

Would love to see them called out via new regulations or a lawsuit, however :)

mwint · on Dec 15, 2021

Why is new regulation the answer here? Let everyone move to Azure, if they care that much about status pages and SLAs.

wholinator2 · on Dec 16, 2021

What if everyone has a financial incentive to lie? (They do) Where do we go then? Also, saying "everyone just leave" is a lot easier than everyone "just leaving", but that's tired and repeated. There's a huge mess and tangle of incentives and drawbacks and I don't know if we'd ever get enough support to weed out a service that gets us above the n'th percentile of greatness. As one falls the other will begin to abuse its power, I dont trust any mega Corp to do otherwise. Do you?

erhk · on Dec 15, 2021

Any public communication is handled by people not machines. No one wants to make an automated status page because theres a shit ton of real noise that users dont need to hear about, nd theres a lot of outages that automation won't accurately catch

vineyardmike · on Dec 15, 2021

> we experience the obligatory black friday us-east-1 outage every year.

Is this a thing?

QuiiBz · on Dec 15, 2021

I tried to monitor services status using https://stop.lying.cloud, but they are also hosted to AWS, and down too.

civilized · on Dec 15, 2021

If they're monitoring AWS downtime they might want to rethink this.

cinntaile · on Dec 15, 2021

How come? It's accurate.

johnisgood · on Dec 15, 2021

True, if it is down, then that means AWS is down (not necessarily, obviously). :D But honestly, if they want to monitor AWS, they gotta pick something else for this reason, something that is not down when AWS is.

civilized · on Dec 15, 2021

I guess it depends on whether you like your FALSE's encoded as timeouts :)

jeanlucas · on Dec 15, 2021

Well... Yes. Hahahah

nexuist · on Dec 16, 2021

Work smarter, not harder

synergy20 · on Dec 15, 2021

AWS should monitor itself from Azure or GCP, even DO or Linode makes more sense.

Eat your own dog food shows confidence, but monitoring it is a different dimension, you need use anything but your own dog food there.

skeeter2020 · on Dec 15, 2021

It's the only realistic multi-cloud provider scenario I can ever come up with that I would consider actually implementing...

itisit · on Dec 15, 2021

AWS wouldn't monitor itself from a competitor, of course, but they could just as well silo a team and isolate DCs to do independent self-auditing.

rozenmd · on Dec 15, 2021

I don't know about AWS, but I know a lot of us uptime monitoring makers use (and pay) for competitor's products to know if we're down.

itisit · on Dec 15, 2021

Rightly so. My point is a company can self-audit without having to pay a competitor.

whimsicalism · on Dec 15, 2021

I think that is inherently riskier because you never know on what axis you will have a failure and it is difficult to exclude all shared axes.

jethro_tell · on Dec 15, 2021

But we're talking about a status page which should be basically static. In it's simplest form you need a rack in 2+ random colos and a few people to manage the page update framework. Then you make teams submit the tests that are used to validate SLA. Run the tests from a few DCs and rebuild the status page every minute or two.

Maybe add a CDN. This shit isn't rocket science and being able to accurately monitor your own systems from off infrastructure is the one time you should really be separate.

8note · on Dec 15, 2021

That applies when you use competitors too.

They could have a related outage, or even a coincidentally timed one

lostlogin · on Dec 15, 2021

Absolutely. And even if it’s cheaper to use the competition, an expensive custom solution will be found.

throwaway81523 · on Dec 15, 2021

They have a bazillion alexa and kindle devices out there that they could monitor from, heh heh. At least let that phone-home behaviour do something useful, like notice AWS is down.

reaperducer · on Dec 15, 2021

AWS wouldn't monitor itself from a competitor, of course

Why not? The big tech companies use each other all the time.

For example, set up a new firewall on macOS and you can see how many times Apple pulls data from Amazon or Azure or other competitors' APIs and services.

jethro_tell · on Dec 15, 2021

Apple is not a competitor to AWS or Azure in any way. They offer not infrastructure/platform as a service that I am aware of.

reaperducer · on Dec 15, 2021

Apple and Amazon are competitors. Apple and Microsoft competitors.

The postulation was that Apple and Amazon weren't competitors. Not that they're not competitors in a specific niche.

jethro_tell · on Dec 16, 2021

But the idea that Amazon or Microsoft or Google would host anything at apple is pretty out there.

Apple uses their competitor's services because they can't build their own cloud and host their own shit. The big boys don't use competitors for services they are capable of building themselves.

QuinnyPig · on Dec 16, 2021

And yet video.nest.com (Google) resolves to an Amazon load balancer.

mrslave · on Dec 15, 2021

A similar reason drives businesses to host `status.product.bigcorp` on a different server. And if your product is a cloud then your suggestion makes sense.

QuinnyPig · on Dec 15, 2021

Yeah, I homed https://stop.lying.cloud out of us-west-2. Oops.

mrslave · on Dec 15, 2021

Considering the sea of bright green circles, reds might stand out but blues get lost in a fast scroll. Perhaps fade or mute the green icon to improve visibility of non-green which is the interesting information?

yonig · on Dec 15, 2021

The brand is strong if you’re really the owner

saganus · on Dec 15, 2021

How does this service work?

It seems to have all the look and feel of AWS, and somehow has more up to date info than the official AWS status page?

fishtoaster · on Dec 15, 2021

It's the same info - it just changes all blues to yellows and all yellows to reds. :)

saganus · on Dec 15, 2021

I had no idea!

Pretty funny actually.

tinco · on Dec 15, 2021

Now that they're back up they're not reporting any problems, how is it supposed to work? It looks like it is just repeating the status reported on the Amazon status page.

fishtoaster · on Dec 15, 2021

It is. It's just the AWS status page run through a transformation function to:

1. Remove all the thousand green services that no one cares about when looking at AWS status

2. Upgrade all yellows to reds because Amazon refuses to list anything as "down" no matter how bad the outage is.

3. Insert a snarky legend

taormina · on Dec 15, 2021

I mean, sounds like it's working as intended then?

mishftw · on Dec 15, 2021

Funny I didn't know that and assumed it was okay

moneywoes · on Dec 15, 2021

That’s hilarious

tomlagier · on Dec 15, 2021

I wonder if AWS will make more or less money from these outages?

Will large players flee because of excessive instability? Or will smaller players go from single-AZ to more expensive multi-AZ?

My guess is that no-one will leave and lots of single-AZ tenants who should be multi-AZ will use this as the impetus to do it.

Honestly, having events like this is probably good for the overall resilience of distributed systems. It's like an immune system, you don't usually fail in the same way repeatedly.

andy_ppp · on Dec 15, 2021

* Free chaos monkey installed in every AZ

jjav · on Dec 15, 2021

> * Free chaos monkey installed in every AZ

Only during this beta period, AWS will start charging for this feature soon enough.

jedberg · on Dec 15, 2021

We (Netflix) begged them for years to create a Chaos Monkey that we could pay for. There were things we just couldn't do ourselves, like simulate a power pull or just drop all network packets on the bare metal. I guess not enough people asked.

kortex · on Dec 16, 2021

CMaaS sounds amazing for resiliency engineering. There's so much I want to be doing to perturb our stack, but I don't know all the ways stuff can go wrong. Sure I can ddos it, kick services and servers offline, etc, but that's what, a few dozen failure modes? Expertise in chaos would be valuable by itself. Not to mention being able to shake parts of the system I normally can't touch.

Side note: terraform is pretty good for causing various kinds of chaos, deliberately or otherwise.

kenhwang · on Dec 15, 2021

If my company is any indication, they're going to make more money since everyone will simply check the multi-AZ or multi-region checkboxes they didn't before and throw more money at the problem instead of doing proper resiliency engineering themselves.

gizmodo59 · on Dec 15, 2021

It doesn’t matter how much of resiliency engineering you do. Having everything in a single AZ is a risk. If this is acceptable then it’s fine if not you need to think of multi az from day 1.

dogecoinbase · on Dec 15, 2021

Auth0 ran in six AZs in two regions[1] and went down today[2], because they picked the wrong two regions. How many regions and AZs should someone pay for before they get reliability?

1: https://auth0.com/blog/auth0-architecture-running-in-multipl... 2: https://twitter.com/auth0/status/1471159935597793290

brewdad · on Dec 16, 2021

At a minimum they should have chosen regions not in the same time zone or general geographic area. US-West 1 and US-West 2 might well be safeguarding against a server failure but is not a disaster plan. If your customers are global, choosing multiple continents is probably prudent.

Corrado · on Dec 16, 2021

Whelp, I guess you're not using Cognito then. It has no user account syncing feature so you can't have a user group in more than one region. Grrrrr!

jorblumesea · on Dec 15, 2021

No one just "moves off" AWS. Once your apps are spaghetti coded with lambdas, buckets and all sorts of stuff, it's basically impossible to get off. More than likely, as you noticed, it will increase spending since multi-AZ/multi-region will become the norm.

s_dev · on Dec 15, 2021

>I wonder if AWS will make more or less money from these outages?

There is no possibility that outages are good for AWS. Nor is there more money to be made from "publicity" of the outages.

moralestapia · on Dec 15, 2021

I think GP has a point with,

>Or will smaller players go from single-AZ to more expensive multi-AZ?

s_dev · on Dec 15, 2021

No -- if they needed to they already would have migrated to a multi-region. If they don't need it -- they won't have. The reason is simple -- it's expensive as you say. I'm not a fanboi or evangelist of AWS either -- I do have pet theories they named their products with shit names in order to make more money by making AWS skills less transferable to Google Cloud etc. S3 should be Amazon FTP, RDS should be Amazon SQL etc.

apetresc · on Dec 15, 2021

> S3 should be Amazon FTP

I... don't think you know what S3 is. Or maybe what FTP is.

(Also S3, EC2, RDS, etc. were named long before GCP had competing services)

ketzo · on Dec 15, 2021

I mean, lots of people put off doing something expensive but safer just because it’s expensive, but rethink after the consequences show.

llbeansandrice · on Dec 15, 2021

S3 is nothing like FTP? RDS stands for Relational Database Service. You have a valid point but picked the worst examples.

hagbarddenstore · on Dec 15, 2021

S3 is Simple Storage Service RDS is Relational Data Service EC2 is Elastic Compute Cloud

All of these make sense.

If you're gonna complain about names, at least pick the really sucky ones, like Athena, Snowball, etc.

Cederfjard · on Dec 15, 2021

You’re saying businesses always make the right decisions and never put them off?

jedberg · on Dec 15, 2021

Not at all the case. It was a regional outage that got Netflix to more than double our AWS spend going multi-region, so that outage netted them millions of extra dollars per year just from Netflix.

dilyevsky · on Dec 15, 2021

You’re underestimating the ability of eng leadership to not take these issues seriously. Only when there’s sufficient pressure from the very top or even the customers it takes a priority.

dexterdog · on Dec 16, 2021

> There is no possibility that outages are good for AWS.

Do you know how many non-technical CEOs/boards/bosses have told their tech people that they need to go multi-region/cloud because that's what the one-paragraph blog and/or tweet told them to do in response to last weeks event?

urthor · on Dec 15, 2021

The actual answer?

In the next 5 calendar years the bottom line will still grow.

However, the brand damage means they permananently lose market share. Which impacts their growth ceiling.

gjvr · on Dec 15, 2021

I would not go multiple Availability Zone within the same Infra/Cloud provider...

ransom1538 · on Dec 15, 2021

"Or will smaller players go from single-AZ to more expensive multi-AZ"

Yes! When you have a service interruption pay 2x more! With a region down I am sure other regions wont have any interruptions either! /s

cebert · on Dec 15, 2021

This outage is extremely frustrating to me. My company hosts all our apps in gov cloud. Gov Cloud West 1 is also down, but the AWS Gov Cloud status page indicates that everything is healthy and green. I thought AWS's incident response to the East outage last week was that they'd update the status page to better reflect reality.

Gov Cloud Status Page: https://status.aws.amazon.com/govcloud

texasviking · on Dec 15, 2021

We are in the same boat. Finally updated "We are investigating Internet connectivity issues to the US-GOV-WEST-1 Region"

chasd00 · on Dec 15, 2021

i had multiple govcloud hosted salesforce instances down but they appear to be coming back up now.

aaronharnly · on Dec 15, 2021

Everyone who spent the past week migrating from us-east-1 to us-west-2: this joke is on you. :)

DarthNebo · on Dec 15, 2021

"US-EAST-1 or bust" being manifested right now.

rexreed · on Dec 15, 2021

It's not just AWS - check the down reports:https://downdetector.com/

Cloudflare having some significant issues as well on certain domains.

yabones · on Dec 15, 2021

It's possible people are reporting the issue as CloudFlare because that's whose error page they see when a box on EC2 is unreachable.

jgrahamc · on Dec 15, 2021

No, we are not. But customers who use AWS are having trouble.

rexreed · on Dec 15, 2021

Thanks for clarifying! Things seem to have settled down.

nerdjon · on Dec 15, 2021

The list of affected services is a bit all over the place, especially since I highly doubt Xbox Live or Halo is running on AWS.

jon-wood · on Dec 15, 2021

Down Detector doesn't really detect anything other than people saying "Is [service X] down?" on Twitter, which does mean that Xbox Live appears to be permanently offline if you believe them because the typical user for Xbox Live will declare anything from tripping over their ethernet cable to a tornado levelling their house preventing a connection to mean Xbox Live is down.

Uehreka · on Dec 15, 2021

It’s still useful if you remove units from the graph and treat it as a sparkline. If there are reliably ~100 Xbox Live complaints on Twitter per hour, then suddenly there are 3000, that’s an outage.

subandi · on Dec 15, 2021

If that were true, the line should be flat-ish, but it and playstation's show the same extreme spike at the same time as aws etc.

iamricks · on Dec 15, 2021

lol imagine if azure was just AWS in the backend

nerdjon · on Dec 15, 2021

Is it bad that I can almost see that being a quick and dirty MVP to get out the door while you built your own cloud solution? Raises serious migration and cost issues, but... would be interesting.

vidarh · on Dec 15, 2021

I think for some targeted things there might well be "value added" services you could offer to transparently wrap AWS. E.g. a "write-through" S3 wrapper was something I was actually looking at because some clients when I was contracting were very reluctant to trust anything but AWS for durability but at the same time AWS bandwidth costs were so extortionate that renting our own servers from somewhere like Hetzner and then proxying writes both to a local disk and to S3 and serve up from local disk with a fallback to pull a fresh copy from S3 if missing broke even at a quite small number of terabytes transferred each month.

The nice part about something like that is that properly wrapped you can change your durable storage as needed, and can easily even selectively pick "cheaper but less trusted" options for less critical data. It also allows you to leverage AWS features to ride closer to the wire. E.g. to take another example than storage, I've used this to cut the cost of managed hosting by being to spill over onto EC2 instances in the past, allowing you to run at much higher utilisation rate than what you can safely on managed / colo / on-prem servers alone - as a result, ironically the ability to spill over onto EC2 makes EC2 far less competitive in terms of cost to actually run stuff on most of the time.

mr_toad · on Dec 16, 2021

> a quick and dirty MVP to get out the door while you built your own cloud solution?

Seemed to work for Dropbox.

s_fischer · on Dec 15, 2021

For the core services? Definitely. But do we really know that some 3rd party API which doesn't fail gracefully isn't causing this?

the_pwner224 · on Dec 15, 2021

HN was also (briefly) down around that same time (roughly 1 hour ago from now).

PragmaticPulp · on Dec 15, 2021

DownDetector is showing everything down during that period, including Google.

I suspect DownDetector itself suffered some outages during this period, which it shows as outages of every service it monitors.

zaltekk · on Dec 15, 2021

That's not how DownDetector works. It just relies on reports from users. The real failure case is users not understanding why they can't access whatever end service. Maybe they blame that service, maybe they blame their ISP, maybe they blame something else.

buryat · on Dec 15, 2021

downdetector.com uses users complaints so it’s unreliable as people can blame anything

ren_engineer · on Dec 15, 2021

some sort of widescale attack would be the only explanation right?

forgotpwd16 · on Dec 15, 2021

This looks weird. At the same time all those services had a spike in outage reports.

chasd00 · on Dec 15, 2021

can confirm i have multiple salesforce instances down.

dijit · on Dec 15, 2021

Is it AWS or could it be an ISP?

AWS seems to be working for me, but I’ve worked with clients in the US and spectrum internet tended to drop connections to us sporadically, which looks like an outage to our clients but is something we obviously can’t control.

tw04 · on Dec 15, 2021

If it's a network issue, it's on their side. I've verified from centurylink, comcast, cogent, he.net, at&t, and verizon - all of them are having issues. This isn't like: Cox is having an outage and just can't get to AWS.

ukyrgf · on Dec 15, 2021

I have an outage way over in the southeast, looks to be affecting the major monopoly ISP. Can't get a tech to our data center until 2PM.

banana_giraffe · on Dec 15, 2021

Things were working during the event, but connectivity was pretty messed up

https://imgur.com/a/VsrS0JZ

(This is two similarly spec'd boxes on us-east-2 and us-west-2). Looking at GeoIP of connecting clients, the only pattern I can see is the region itself.

adriand · on Dec 15, 2021

I'm wondering the same thing. We have stuff hosted in us-west-2 and multiple people across the US are reporting that our systems are down, however our system is working fine for me here, which is near Toronto.

iamricks · on Dec 15, 2021

When us-east was down recently, our apps were not effected and we host on east. Maybe a similar issue?

judge2020 · on Dec 15, 2021

The east-1 downtime was the interconnection between AWS hosted services, including the control plane, so most resources not dependent on AWS APIs stayed up (eg. non-autoscaled EC2 instances).

https://news.ycombinator.com/item?id=29516482

albatross13 · on Dec 15, 2021

Currently we're seeing 40kms response times from CloudFront distributions, we can't hit PagerDuty (probably runs on AWS), etc.

I guess it could be an ISP thing but I guess we're all assuming 80/20.

avs733 · on Dec 15, 2021

I wonder if you really dug into most company's tech stacks, how many of their support tools (e.g., PagerDuty) are reliant on overlapping cloud providers.

albatross13 · on Dec 15, 2021

Oh man, it is insane. During the aws incident last week we couldn't build software because bitbucket pipelines were all down, due to them running lambdas in us-east-1 only haha.

We've taken a massive turn away from a "decentralized" internet.

avs733 · on Dec 15, 2021

it's still decentralized...it's just a centralized version of it right?

just like Cavendish bananas are grown in multiple places...

simcop2387 · on Dec 15, 2021

Yea a number of people got hit by that, Louis Rossmann found out that every form of contact to his buisness was reliant on AWS east 1. https://www.youtube.com/watch?v=DE05jXUZ-FY

treis · on Dec 15, 2021

It was an AWS networking issue 90%+ packet loss pinging to Google & Facebook.

gitfan86 · on Dec 15, 2021

I'm so glad that I'm not still the CTO of a startup. I would be getting dozens of e-mails from people without engineering backgrounds asking "Are we multi-cloud", "why didn't you make us multi-cloud"?

necovek · on Dec 15, 2021

Well, why didn't you? :)

The response is that this actually works well enough, so the investment required has not pushed anyone to do it (with that meaning building the core infrastructure to make that easy).

nic_wilson · on Dec 15, 2021

We are seeing issues with requests to Auth0, which I believe is hosted on AWS and has historically gone down when AWS has had issues

romanhotsiy · on Dec 15, 2021

We see issues with Auth0 too. Other AWS services we use seem to be working fine so far (us-east-1)

heartbreak · on Dec 15, 2021

AWS is reporting an issue in us-west-2 on their status page.

ramesh31 · on Dec 15, 2021

Auth0 went down for us as well right when AWS did. At least it's not like those two systems run our entire company...

waynecochran · on Dec 15, 2021

There was a brief period of time back in the early 90's where I felt I understood how Linux worked -- the kernel, startup scripts, drivers, processors, boot tools, etc... I could actually work on all levels of the system to some degree. Those days are long gone. I am far removed from many details of the systems I use today. I used to do a lot of assembly programming on multiple systems. Today I am not sure how most of the systems works in much detail.

cle · on Dec 15, 2021

To an extent, this is one of the goals, to free up engineers to work on higher level things. Whether it meets that goal in some cases is debatable, and it’s certainly not ideal for us engineers who like to get to the bottom of things.

someguydave · on Dec 15, 2021

“working on higher level things” currently implies that depending on many layers of opaque and unreliable lower level hardware and software abstractions is a good idea. I think it is a mistake.

cle · on Dec 15, 2021

The best conclusion I can come to is "sometimes it works, sometimes it doesn't". Depends on the context. I've seen cases where it works great and other times where it's a huge hassle.

10000truths · on Dec 15, 2021

Funny, I feel the exact opposite way. The low level stuff is where all the magic happens, where performance improvements can scale by orders of magnitude rather than linearly with a CTO’s budget. I’d much rather figure out how to condense some over-engineered distributed solution down to one machine with resources to spare.

adnauseum · on Dec 15, 2021

Seems like ever since Microsoft bought AWS, it's been going down an awful lot.

endisneigh · on Dec 15, 2021

> Seems like ever since Microsoft bought AWS, it's been going down an awful lot.

What?

rfoo · on Dec 15, 2021

Satire.

Every time Github went down multiple people post on HN saying "every since they were bought by Microsoft, ...". As annoying as those Rust evangelists on every single memory corruption bug.

staticassertion · on Dec 15, 2021

> As annoying as those Rust evangelists on every single memory corruption bug.

First of all, how dare you!

Second, shoulda used rust ¯\_(ツ)_/¯

elasticventures · on Dec 16, 2021

I could have written the OP message a year ago -- I used to feel the same way.

Plz don't disparage Rust evangelism!

Rust is awesome. yes it is complex, frequently annoying, easy to learn difficult to master. I'm speaking from a 30 year dev career.

a few months ago I intended to do a quick investigation into RUST to validate my "i really don't need to learn this" specifically for an embedded project. Within a few hours I found I had become a zealot. Rust has too many "omg, i should tell everybody about this" behaviors that I can't even find my favorite aspect yet.

It's equivalent to a lost soul finding Christianity and accepting the lords blessing and forgiveness! The weight that is lifted of being forgiven to your sins resulting == no more guilt, it's all forgiven! immediately reduction of cognitive dissonance. in this example with rust, it's pointer tracking and memory management, but it's basically the same thing. Rust is for the pious developer.

Those people who are still using C++ for fresh starts are the same folks who love to do things the hard & wrong way, or at least those who don't know any better, infidels, unwashed heathen.

Join us. join rUSt.

rfoo · on Dec 16, 2021

While I'm not sure whether you're serious ;), to be clear: what annoys me is they don't really understand why we are having "someone pwned your phone via a series of memory corruption bugs" daily.

Until those Rust evangelists managed to rewrite the world with Rust (and I promise you there still will be a lot of security bugs), we still have to fix our shit in a low-cost way and their evangelism does not help at all and is pure annoyance.

staticassertion · on Dec 16, 2021

> what annoys me is they don't really understand why we are having "someone pwned your phone via a series of memory corruption bugs" daily.

No, I understand it. I started out in vuln research and have been in defense for a decade. It's probably fair to even say I'm an expert on it.

I'm going to keep advocating for rust as one the highest ROIs for improving security.

metaltyphoon · on Dec 15, 2021

Obviously while using Arch btw

masterof0 · on Dec 15, 2021

Didn't know Tim Dillon is hanging on here in HN.

exdsq · on Dec 15, 2021

Haha wtf?

turtlebits · on Dec 15, 2021

That was fun. Badges weren't working (daily checkin required) so the front desk had to manually activate them.

Slack wasn't sending messages and Pagerduty was throwing 500's.

api · on Dec 15, 2021

... because you need to contact a server 1000 miles away to issue badges in your building.

This cloud-for-everything-even-local-devices thing is both hilarious and sad.

I wonder if anyone had trouble doing their dishes or laundry today, because I'm sure someone thought dish washers and washing machines needed cloud.

ec109685 · on Dec 15, 2021

I don't know if you can say an on-premise badge hosting service would be more reliable than the cloud.

kazen44 · on Dec 15, 2021

well, atleast you have the agency to do something about it yourself.

also, building access systems should be hosted in the building they reside in for security reasons anyways.

marcosdumay · on Dec 15, 2021

This creates some really fun failure cases on the form of "I need to enter the building so anybody can enter the building".

Depending on the cloud is certainly a very stupid decision. keeping everything inside the building is better, but still not ideal.

jjav · on Dec 15, 2021

Any electronic access system like this requires manual backup. As in, some doors with regular locks using physical keys.

kazen44 · on Dec 15, 2021

it requires an override anyways in case of emergencies like a fire.

reaperducer · on Dec 15, 2021

Taking badges out of the cloud reduces points of failure by several orders of magnitude.

Cloud-based badges make sense if you have locations with small staffs and no HR people or managers. Like if you're controlling access to a microwave tower on the top of a mountain.

But badges-in-the-cloud for an office building full of people who are being supervised by supposedly trusted managers, and all of whom has been vetted for security and by HR, is just being cheap.

Like the 1980's AT&T commercials used to say: "You get what you pay for."

orourkek · on Dec 15, 2021

> Taking badges out of the cloud reduces points of failure by several orders of magnitude.

I'm not convinced that's true, or at least certainly not an order of magnitude. Wouldn't a badge system hosted on-prem also need a user management system (database), a hosted management interface, have a dependency on the LAN, and need most of the same hardware? Such a system would also need to be running on a local server(s), which introduces points of failure around power continuity/surges, physical security, ongoing maintenance, etc.

reaperducer · on Dec 15, 2021

All of those things would also be needed by the cloud provider, too. Just because it's on-prem doesn't mean it doesn't need servers, power conditioning, physical security, etc. "Cloud" isn't magic fairies. It's just renting someone else's points of failure.

In addition, you're forgetting the thousands of points of failure between the building and the cloud provider. Everything from routers being DDOSed by script kiddies to ransomware gangs attacking infrastructure to Phil McCracken slicing a fiber line with his new post hole digger.

jjav · on Dec 15, 2021

The remote solution requires all of those same things, plus in addition it requires internet connectivity to be up and reliable, the cloud provider be available and the third party company be up and still in business.

Adding complexity and moving parts never reduces points of failure. It can reduce daily operating worries as long as everything works, but it can't reduce points of failure. It also means than someday when it breaks, the root causes will be more opaque.

ec109685 · on Dec 16, 2021

Within the building’s on premise hosted infrastructure, are they going to buy multiple racks and multiple servers spread far enough apart so that there aren’t many single points of failure that will bring the badge machine down if they fail?

owlbynight · on Dec 15, 2021

Yes, everyone but you is wrong.

Many logical people have decided to abstract away their soul-crushing anxieties and legal gray area during outages to incredibly stable and well-staffed cloud infrastructure providers.

If you and your team are better at taking care of hardware than an entire building full of highly paid engineering specialists, then that's cool for you, but also, no you're not.

That's not to say you're not capable of running on-prem hardware that is stable.

I'm just saying that the high-handed swiping away of everyone else who's made an incredibly safe and logical decision to host their stuff in the cloud makes me question your general vibe.

user3939382 · on Dec 15, 2021

> If you and your team are better at taking care of hardware than an entire building full of highly paid engineering specialists

The trade offs aren't quite that simple. Those specialists are necessary because they're building and maintaining infrastructure that's extremely complex since it has a crazy scale and has to be all things to all people. When you're running in-house, your infrastructure is simpler because it's custom tailored to your specific requirements and scale.

There are tradeoffs that make cloud vs local make sense in different contexts and there's no one right answer.

Nextgrid · on Dec 15, 2021

> but also, no you're not.

If you plan to replicate all of AWS I'd agree with you. But if all you need is a handful of servers, you could end up with better uptime doing it in-house just because you don't have all the moving parts that make AWS tick, reducing the chance for something to go wrong.

My bare-metal servers stayed up during both of the recent outages, not because I'm some kind of genius that's better than the AWS engineers but just because it's a dead simple stack that has zero moving parts and my project doesn't require anything more complex.

jjav · on Dec 15, 2021

There is absolutely no reason for a local device (like a door lock or dishwasher as per OP) to depend on any external connectivity. Not to the company on-prem hardware, not to AWS.

yabones · on Dec 15, 2021

Yep, it's broken again. I was trying to install some Thunderbird extensions, and stuff started breaking halfway through. Never thought of an AWS outage borking my mail client I guess...

lukeqsee · on Dec 15, 2021

We lost all public IPv6 in the Linode Newark DC.

This appears to be cross-provider.

Edit: We have IPv6 back.

kp195_ · on Dec 15, 2021

We're having issues connecting to our EC2 bastions and accessing the us-west-1 dashboard too

EDIT: Cognito auth seems down for us too

EDIT2: our ALBs are timing out as well

EDIT3: us-west-1 looks like working now!

myth_drannon · on Dec 15, 2021

That's the price of PIP culture and burning out your devs. Now noone wants to work at Amazon and they can only hire new grads.

Throwawayaerlei · on Dec 15, 2021

I hear they do get people who want to be able to get experience at AWS's scale, there's only a few places for that.

The thing that really gets me is the reports from the last major outage a few days ago about how pervasive lying inside the company is. This really doesn't work well for engineering and we're possibly seeing the results of that. We should certainly expect to see that becoming visible the more time goes on without a major cultural shift. Which given that the guy who ran AWS now runs all of Amazon.com....

dhruvarora013 · on Dec 15, 2021

Looks like its taken down SendGrid, NPM, Twitch, Auth0 so far

hericium · on Dec 15, 2021

PlayStation Network went down at the same time.

cyral · on Dec 15, 2021

Stripe as well

ents · on Dec 15, 2021

Notion as well

iamricks · on Dec 15, 2021

How much do you guys think these frequent outages will effect their market share in cloud products?

Is this enough of a push for organizations to actually move over their infrastructure to other providers?

ceejayoz · on Dec 15, 2021

Not at all.

The other cloud providers have had their own outages.

bravetraveler · on Dec 15, 2021

Sadly this, people are entrenched with AWS and the... "We're not the only ones down" thing truly has some effect

Organizations can more easily swallow an AWS failure when they aren't the only ones hit. They move elsewhere, those outages look more unique

Folks may think multi cloud is a good idea... But you're just as likely to suffer from the extra points of failure as you are to benefit

tyingq · on Dec 15, 2021

Multi-cloud is such an odd idea to me. You're either building abstractions on top of things like cloud-provider specific implementations of CDNs, K8S, S3, Postgres, etc...or using the cloud just for VMs. The latter would be cheaper with just old-school hosting from Equinix, Rackspace, etc. The former feels like a losing battle.

pm90 · on Dec 15, 2021

It’s prompted discussions of building multi regional services in my org but not multi cloud. They would have to really really really screw up for that to happen… maybe be down for like a week or something.

cblconfederate · on Dec 15, 2021

Reminder that the internet was literally invented to avoid this kind of nuclear attack. But i guess people are herdish animals and prefer to die as a group

throw_m239339 · on Dec 15, 2021

More like ultimately all these companies buy into a certain form of vendor lock-in and they have no competence or willingness to migrate or even consider the competition. It's starts with "oh I'm just renting a remote virtual server" and in no time it's "Oh, all my stack is tied to AWS proprietary products" because convenience. That's what Amazon wants.

doublepg23 · on Dec 15, 2021

Seems like the Internet level networking is quite robust at this point.

ceejayoz · on Dec 15, 2021

We're having troubles in us-west-2.

Discourse is reporting trouble, too. https://twitter.com/DiscourseStatus/status/14711403698992906...

supermathie · on Dec 15, 2021

us-west-1 also seems offline, but us-east-1 (ironically) seems fine

wirelesspotat · on Dec 15, 2021

AWS status page shows an update:

> AWS Internet Connectivity (Oregon): 7:42 AM PST We are investigating Internet connectivity issues to the US-WEST-2 Region.

Source: https://status.aws.amazon.com

alvis · on Dec 15, 2021

Oh. not again...

TekMol · on Dec 15, 2021

It is surprising that their status page is down too:

https://status.aws.amazon.com

Their CDN, CloudFront, always works reliable for me. Couldn't they put the status page on CloudFront?