Hacker News new | past | comments | ask | show | jobs | submit login
Tell HN: AWS appears to be down again
879 points by thadjo 41 days ago | hide | past | favorite | 468 comments
Anyone else seeing this?

I checked their health status page. All is good. /s


They did add an update, faster than last time:

"7:42 AM PST We are investigating Internet connectivity issues to the US-WEST-2 Region."


Edit: They added US-WEST-1:

"7:52 AM PST We are investigating Internet connectivity issues to the US-WEST-1 Region."

Edit: Found root case, maybe?

"8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-1 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery."

"8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-2 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery."

Too bad I am unable to load the status page due to connection timeouts, so I can't see the updates.

someone tripped over the fiber run i bet. Or, a cleaning person unplugged a router to plugin a vacuum (that actually happened but to a minicomputer iirc)

Unfortunately the vacuum, a shiny IoT connected appliance, didn't work because AWS was down

Usually the problem is "an idiot with a digger".

nah man, it's never the digger that's the idiot. it's always the project manager that told the digger where to dig. just like it's never the dev's fault as the PM made them do it. /s

No way a cleaning person can do that in a datacenter.

I hope that their infra is not that unstable

It's interesting that west-2 was quicker to create the incident (despite the issue starting a bit later there, at least by our experience), and while they both "identified" at the same time, west-2 also waited longer to call it resolved.

I assume there are different teams responsible for each, is the west-2 team just more on top of things?

West-2 also launched many years after us-east-1, so less legacy to deal with.

1.US-East-1 wasn't involved today.

2. They don't really have much "legacy" stuff to deal with since they likely turn over racks quickly across their whole fleet and software deployments should be standardized, so any US-east-1 flakiness has to do with the fact that its where amazon houses their control planes often.

There's at least one AZ in East-1 that doesn't support nitro, and that's been around for 4ish years now...

I agree in principle, but clearly something is hobbling them because of (probably) legacy stuff

The issue is not specific to the US, same issues in Europe. Also, it seems not only AWS experiencing issues. Unless Google is hosted on AWS haha...

Yes, it could be network peering related. But there's definitely a lot of us-west-1 and us-west-2 users complaining and people saying that us-east-1 seems fine.

Seems to be resolved now. And seems they hid / took away any mentioning of possible issues. Sigh.

It's still there now, on the top of the page, just marked resolved:


7:52 AM PST We are investigating Internet connectivity issues to the US-WEST-1 Region.

8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-1 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery.

8:10 AM PST We have resolved the issue affecting Internet connectivity to the US-WEST-1 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.


7:43 AM PST We are investigating Internet connectivity issues to the US-WEST-2 Region.

8:01 AM PST We have identified the root cause of the Internet connectivity to the US-WEST-2 Region and have taken steps to restore connectivity. We have seen some improvement to Internet connectivity in the last few minutes but continue to work towards full recovery.

8:14 AM PST We have resolved the issue affecting Internet connectivity to the US-WEST-2 Region. Connectivity within the region was not affected by this event. The issue has been resolved and the service is operating normally.

That is a shame. Anyone coming in after the fact to investigate an outage or glitch with their systems will need to look harder to find a known AWS outage. We can’t assume everyone looks at HN.

Practice makes perfect

So it is down again.

Ok, so it can't be down then. This is proof!

Yep, when it loads, it's all green. "nine nines!!!"

I thought that sounded ridiculous so I did the math and 99.9999999 uptime allows for 1.314 _seconds_ of downtime every 1000 years. It would take approximately 2.7 million years to acquire just an hours worth of allowable downtime, that's how long it takes light from the second nearest spiral galaxy and farthest visible object to the eye in perfect conditions [1]. Within a single quarter of a year, that's 328.5 μs (microseconds) or about 1200 blinks of an eye [2] or about 3 times faster than a typical electric capacitor camera flash [3], also approximately, and interestingly enough, less than 1% of my current ping to my ISP let alone Amazon's servers.

So yeah, having done that I now understand that it was probably a joke but it really puts into perspective just how ridiculous things can get with a few 9's.

[1] https://earthsky.org/clusters-nebulae-galaxies/triangulum-ga...

[2] https://www.verywellhealth.com/why-do-we-blink-our-eyes-3879...

[3] https://en.m.wikipedia.org/wiki/Flash_(photography) (wikipedia's won't let me deep link on my phone, it's in electronic flash section under types)

emphasis on when

60% of the time, it's all-green 100% of the time

Down detector is just a statistical page, it does not actually detect downtime, and is in no way aws's status page.

What does downdetector run on?

User reports — i.e. the number of people who google “is X down” and then click a Down Detector link.

It’s a clever way of getting reasonably accurate data very quickly and easily, though it does have it’s flaws — the data is pretty noisy and users often attribute outages to the wrong service (e.g. blaming their ISP or Microsoft or something when YouTube is down, or vice versa).

I would guess the user is asking what are down detector's dependencies... E.g. can their website function I'd us-east-2 goes down? Or a GCP equivalent? Or are they on a self-hosted server ? What would cause the metrics to be "off"

They really need to stop requiring SVPs or higher to show non-green status on the status page, as other HNers have revealed in last week's AWS post. It's effectively not a status page, and they could probably be sued if it can be demonstrated that X service was down but the status page showed green (since the SLA is based on status page). Should be automated and based on sample deployments running in every region and every service. And they should use non-AWS instances to do the sampling, so they can actually sample when, say, we experience the obligatory black friday us-east-1 outage every year.

I think SVP / GM approval is only needed for yellow / red status. From my time in AWS Support, the Support Oncall and Call Leader / GM delegate worked to approve green-i posts.

If my app won't run for reasons that are not my fault for longer than the SLA guarantees, the affected services should be at least yellow status and I should be accumulating free AWS credits.

They were much faster than usual about updating the AWS Status page.

With some lame ass tiny blue "connectivity issues" informational text. Surely broken routing to two entire DCs is full red for all services available therein?

Like what, the networking is broken but if you could send packets, the services would still work so they are green?

I was still able to reach our service running in us-west-1 when the connectivity issue was still on-going, so I don't know if it was a full interruption.

Our ~four person ops team shouldn't be able to have our status page updated 15 minutes before the upstream status page...

I thought Status Pages or Health Pages is designed to automate the reporting and checking the status automatically. This was my impression when I came across those status pages. Apparently, it is not automated and only update it manually. What is the point of having a status pages if it cannot be automated? I'm sure FAANG and tech conglomerates don't want it to be automated because of SLA.

I'm surprised with FAANG hosted their stuff in their competitors cloud services without providing a fallback cloud service if the primary service is down. Sure it cost money but it would be effective this way than putting all eggs in one basket.

As stated earlier, AWS has financial incentive to not update the status page. Nobody is willing to call them on the conflict of interest in a meaningful, market-changing way.

Perhaps someone could produce an alternate, Patreon-supported status page that accurately reports on the status of AWS services.

Would love to see them called out via new regulations or a lawsuit, however :)

Why is new regulation the answer here? Let everyone move to Azure, if they care that much about status pages and SLAs.

What if everyone has a financial incentive to lie? (They do) Where do we go then? Also, saying "everyone just leave" is a lot easier than everyone "just leaving", but that's tired and repeated. There's a huge mess and tangle of incentives and drawbacks and I don't know if we'd ever get enough support to weed out a service that gets us above the n'th percentile of greatness. As one falls the other will begin to abuse its power, I dont trust any mega Corp to do otherwise. Do you?

Any public communication is handled by people not machines. No one wants to make an automated status page because theres a shit ton of real noise that users dont need to hear about, nd theres a lot of outages that automation won't accurately catch

> we experience the obligatory black friday us-east-1 outage every year.

Is this a thing?

I tried to monitor services status using https://stop.lying.cloud, but they are also hosted to AWS, and down too.

If they're monitoring AWS downtime they might want to rethink this.

How come? It's accurate.

True, if it is down, then that means AWS is down (not necessarily, obviously). :D But honestly, if they want to monitor AWS, they gotta pick something else for this reason, something that is not down when AWS is.

I guess it depends on whether you like your FALSE's encoded as timeouts :)

Well... Yes. Hahahah

Work smarter, not harder

AWS should monitor itself from Azure or GCP, even DO or Linode makes more sense.

Eat your own dog food shows confidence, but monitoring it is a different dimension, you need use anything but your own dog food there.

It's the only realistic multi-cloud provider scenario I can ever come up with that I would consider actually implementing...

AWS wouldn't monitor itself from a competitor, of course, but they could just as well silo a team and isolate DCs to do independent self-auditing.

I don't know about AWS, but I know a lot of us uptime monitoring makers use (and pay) for competitor's products to know if we're down.

Rightly so. My point is a company can self-audit without having to pay a competitor.

I think that is inherently riskier because you never know on what axis you will have a failure and it is difficult to exclude all shared axes.

But we're talking about a status page which should be basically static. In it's simplest form you need a rack in 2+ random colos and a few people to manage the page update framework. Then you make teams submit the tests that are used to validate SLA. Run the tests from a few DCs and rebuild the status page every minute or two.

Maybe add a CDN. This shit isn't rocket science and being able to accurately monitor your own systems from off infrastructure is the one time you should really be separate.

That applies when you use competitors too.

They could have a related outage, or even a coincidentally timed one

Absolutely. And even if it’s cheaper to use the competition, an expensive custom solution will be found.

They have a bazillion alexa and kindle devices out there that they could monitor from, heh heh. At least let that phone-home behaviour do something useful, like notice AWS is down.

AWS wouldn't monitor itself from a competitor, of course

Why not? The big tech companies use each other all the time.

For example, set up a new firewall on macOS and you can see how many times Apple pulls data from Amazon or Azure or other competitors' APIs and services.

Apple is not a competitor to AWS or Azure in any way. They offer not infrastructure/platform as a service that I am aware of.

Apple and Amazon are competitors. Apple and Microsoft competitors.

The postulation was that Apple and Amazon weren't competitors. Not that they're not competitors in a specific niche.

But the idea that Amazon or Microsoft or Google would host anything at apple is pretty out there.

Apple uses their competitor's services because they can't build their own cloud and host their own shit. The big boys don't use competitors for services they are capable of building themselves.

And yet video.nest.com (Google) resolves to an Amazon load balancer.

A similar reason drives businesses to host `status.product.bigcorp` on a different server. And if your product is a cloud then your suggestion makes sense.

Yeah, I homed https://stop.lying.cloud out of us-west-2. Oops.

Considering the sea of bright green circles, reds might stand out but blues get lost in a fast scroll. Perhaps fade or mute the green icon to improve visibility of non-green which is the interesting information?

The brand is strong if you’re really the owner

How does this service work?

It seems to have all the look and feel of AWS, and somehow has more up to date info than the official AWS status page?

It's the same info - it just changes all blues to yellows and all yellows to reds. :)

I had no idea!

Pretty funny actually.

Now that they're back up they're not reporting any problems, how is it supposed to work? It looks like it is just repeating the status reported on the Amazon status page.

It is. It's just the AWS status page run through a transformation function to:

1. Remove all the thousand green services that no one cares about when looking at AWS status

2. Upgrade all yellows to reds because Amazon refuses to list anything as "down" no matter how bad the outage is.

3. Insert a snarky legend

I mean, sounds like it's working as intended then?

Funny I didn't know that and assumed it was okay

That’s hilarious

I wonder if AWS will make more or less money from these outages?

Will large players flee because of excessive instability? Or will smaller players go from single-AZ to more expensive multi-AZ?

My guess is that no-one will leave and lots of single-AZ tenants who should be multi-AZ will use this as the impetus to do it.

Honestly, having events like this is probably good for the overall resilience of distributed systems. It's like an immune system, you don't usually fail in the same way repeatedly.

* Free chaos monkey installed in every AZ

> * Free chaos monkey installed in every AZ

Only during this beta period, AWS will start charging for this feature soon enough.

We (Netflix) begged them for years to create a Chaos Monkey that we could pay for. There were things we just couldn't do ourselves, like simulate a power pull or just drop all network packets on the bare metal. I guess not enough people asked.

CMaaS sounds amazing for resiliency engineering. There's so much I want to be doing to perturb our stack, but I don't know all the ways stuff can go wrong. Sure I can ddos it, kick services and servers offline, etc, but that's what, a few dozen failure modes? Expertise in chaos would be valuable by itself. Not to mention being able to shake parts of the system I normally can't touch.

Side note: terraform is pretty good for causing various kinds of chaos, deliberately or otherwise.

If my company is any indication, they're going to make more money since everyone will simply check the multi-AZ or multi-region checkboxes they didn't before and throw more money at the problem instead of doing proper resiliency engineering themselves.

It doesn’t matter how much of resiliency engineering you do. Having everything in a single AZ is a risk. If this is acceptable then it’s fine if not you need to think of multi az from day 1.

Auth0 ran in six AZs in two regions[1] and went down today[2], because they picked the wrong two regions. How many regions and AZs should someone pay for before they get reliability?

1: https://auth0.com/blog/auth0-architecture-running-in-multipl... 2: https://twitter.com/auth0/status/1471159935597793290

At a minimum they should have chosen regions not in the same time zone or general geographic area. US-West 1 and US-West 2 might well be safeguarding against a server failure but is not a disaster plan. If your customers are global, choosing multiple continents is probably prudent.

Whelp, I guess you're not using Cognito then. It has no user account syncing feature so you can't have a user group in more than one region. Grrrrr!

No one just "moves off" AWS. Once your apps are spaghetti coded with lambdas, buckets and all sorts of stuff, it's basically impossible to get off. More than likely, as you noticed, it will increase spending since multi-AZ/multi-region will become the norm.

>I wonder if AWS will make more or less money from these outages?

There is no possibility that outages are good for AWS. Nor is there more money to be made from "publicity" of the outages.

I think GP has a point with,

>Or will smaller players go from single-AZ to more expensive multi-AZ?

No -- if they needed to they already would have migrated to a multi-region. If they don't need it -- they won't have. The reason is simple -- it's expensive as you say. I'm not a fanboi or evangelist of AWS either -- I do have pet theories they named their products with shit names in order to make more money by making AWS skills less transferable to Google Cloud etc. S3 should be Amazon FTP, RDS should be Amazon SQL etc.

> S3 should be Amazon FTP

I... don't think you know what S3 is. Or maybe what FTP is.

(Also S3, EC2, RDS, etc. were named long before GCP had competing services)

I mean, lots of people put off doing something expensive but safer just because it’s expensive, but rethink after the consequences show.

S3 is nothing like FTP? RDS stands for Relational Database Service. You have a valid point but picked the worst examples.

S3 is Simple Storage Service RDS is Relational Data Service EC2 is Elastic Compute Cloud

All of these make sense.

If you're gonna complain about names, at least pick the really sucky ones, like Athena, Snowball, etc.

You’re saying businesses always make the right decisions and never put them off?

Not at all the case. It was a regional outage that got Netflix to more than double our AWS spend going multi-region, so that outage netted them millions of extra dollars per year just from Netflix.

You’re underestimating the ability of eng leadership to not take these issues seriously. Only when there’s sufficient pressure from the very top or even the customers it takes a priority.

> There is no possibility that outages are good for AWS.

Do you know how many non-technical CEOs/boards/bosses have told their tech people that they need to go multi-region/cloud because that's what the one-paragraph blog and/or tweet told them to do in response to last weeks event?

The actual answer?

In the next 5 calendar years the bottom line will still grow.

However, the brand damage means they permananently lose market share. Which impacts their growth ceiling.

I would not go multiple Availability Zone within the same Infra/Cloud provider...

"Or will smaller players go from single-AZ to more expensive multi-AZ"

Yes! When you have a service interruption pay 2x more! With a region down I am sure other regions wont have any interruptions either! /s

This outage is extremely frustrating to me. My company hosts all our apps in gov cloud. Gov Cloud West 1 is also down, but the AWS Gov Cloud status page indicates that everything is healthy and green. I thought AWS's incident response to the East outage last week was that they'd update the status page to better reflect reality.

Gov Cloud Status Page: https://status.aws.amazon.com/govcloud

We are in the same boat. Finally updated "We are investigating Internet connectivity issues to the US-GOV-WEST-1 Region"

i had multiple govcloud hosted salesforce instances down but they appear to be coming back up now.

Everyone who spent the past week migrating from us-east-1 to us-west-2: this joke is on you. :)

"US-EAST-1 or bust" being manifested right now.

It's not just AWS - check the down reports:https://downdetector.com/

Cloudflare having some significant issues as well on certain domains.

It's possible people are reporting the issue as CloudFlare because that's whose error page they see when a box on EC2 is unreachable.

No, we are not. But customers who use AWS are having trouble.

Thanks for clarifying! Things seem to have settled down.

The list of affected services is a bit all over the place, especially since I highly doubt Xbox Live or Halo is running on AWS.

Down Detector doesn't really detect anything other than people saying "Is [service X] down?" on Twitter, which does mean that Xbox Live appears to be permanently offline if you believe them because the typical user for Xbox Live will declare anything from tripping over their ethernet cable to a tornado levelling their house preventing a connection to mean Xbox Live is down.

It’s still useful if you remove units from the graph and treat it as a sparkline. If there are reliably ~100 Xbox Live complaints on Twitter per hour, then suddenly there are 3000, that’s an outage.

If that were true, the line should be flat-ish, but it and playstation's show the same extreme spike at the same time as aws etc.

lol imagine if azure was just AWS in the backend

Is it bad that I can almost see that being a quick and dirty MVP to get out the door while you built your own cloud solution? Raises serious migration and cost issues, but... would be interesting.

I think for some targeted things there might well be "value added" services you could offer to transparently wrap AWS. E.g. a "write-through" S3 wrapper was something I was actually looking at because some clients when I was contracting were very reluctant to trust anything but AWS for durability but at the same time AWS bandwidth costs were so extortionate that renting our own servers from somewhere like Hetzner and then proxying writes both to a local disk and to S3 and serve up from local disk with a fallback to pull a fresh copy from S3 if missing broke even at a quite small number of terabytes transferred each month.

The nice part about something like that is that properly wrapped you can change your durable storage as needed, and can easily even selectively pick "cheaper but less trusted" options for less critical data. It also allows you to leverage AWS features to ride closer to the wire. E.g. to take another example than storage, I've used this to cut the cost of managed hosting by being to spill over onto EC2 instances in the past, allowing you to run at much higher utilisation rate than what you can safely on managed / colo / on-prem servers alone - as a result, ironically the ability to spill over onto EC2 makes EC2 far less competitive in terms of cost to actually run stuff on most of the time.

> a quick and dirty MVP to get out the door while you built your own cloud solution?

Seemed to work for Dropbox.

For the core services? Definitely. But do we really know that some 3rd party API which doesn't fail gracefully isn't causing this?

HN was also (briefly) down around that same time (roughly 1 hour ago from now).

DownDetector is showing everything down during that period, including Google.

I suspect DownDetector itself suffered some outages during this period, which it shows as outages of every service it monitors.

That's not how DownDetector works. It just relies on reports from users. The real failure case is users not understanding why they can't access whatever end service. Maybe they blame that service, maybe they blame their ISP, maybe they blame something else.

downdetector.com uses users complaints so it’s unreliable as people can blame anything

some sort of widescale attack would be the only explanation right?

This looks weird. At the same time all those services had a spike in outage reports.

can confirm i have multiple salesforce instances down.

Is it AWS or could it be an ISP?

AWS seems to be working for me, but I’ve worked with clients in the US and spectrum internet tended to drop connections to us sporadically, which looks like an outage to our clients but is something we obviously can’t control.

If it's a network issue, it's on their side. I've verified from centurylink, comcast, cogent, he.net, at&t, and verizon - all of them are having issues. This isn't like: Cox is having an outage and just can't get to AWS.

I have an outage way over in the southeast, looks to be affecting the major monopoly ISP. Can't get a tech to our data center until 2PM.

Things were working during the event, but connectivity was pretty messed up


(This is two similarly spec'd boxes on us-east-2 and us-west-2). Looking at GeoIP of connecting clients, the only pattern I can see is the region itself.

I'm wondering the same thing. We have stuff hosted in us-west-2 and multiple people across the US are reporting that our systems are down, however our system is working fine for me here, which is near Toronto.

When us-east was down recently, our apps were not effected and we host on east. Maybe a similar issue?

The east-1 downtime was the interconnection between AWS hosted services, including the control plane, so most resources not dependent on AWS APIs stayed up (eg. non-autoscaled EC2 instances).


Currently we're seeing 40kms response times from CloudFront distributions, we can't hit PagerDuty (probably runs on AWS), etc.

I guess it could be an ISP thing but I guess we're all assuming 80/20.

I wonder if you really dug into most company's tech stacks, how many of their support tools (e.g., PagerDuty) are reliant on overlapping cloud providers.

Oh man, it is insane. During the aws incident last week we couldn't build software because bitbucket pipelines were all down, due to them running lambdas in us-east-1 only haha.

We've taken a massive turn away from a "decentralized" internet.

it's still decentralized...it's just a centralized version of it right?

just like Cavendish bananas are grown in multiple places...

Yea a number of people got hit by that, Louis Rossmann found out that every form of contact to his buisness was reliant on AWS east 1. https://www.youtube.com/watch?v=DE05jXUZ-FY

It was an AWS networking issue 90%+ packet loss pinging to Google & Facebook.

I'm so glad that I'm not still the CTO of a startup. I would be getting dozens of e-mails from people without engineering backgrounds asking "Are we multi-cloud", "why didn't you make us multi-cloud"?

Well, why didn't you? :)

The response is that this actually works well enough, so the investment required has not pushed anyone to do it (with that meaning building the core infrastructure to make that easy).

We are seeing issues with requests to Auth0, which I believe is hosted on AWS and has historically gone down when AWS has had issues

We see issues with Auth0 too. Other AWS services we use seem to be working fine so far (us-east-1)

AWS is reporting an issue in us-west-2 on their status page.

Auth0 went down for us as well right when AWS did. At least it's not like those two systems run our entire company...

There was a brief period of time back in the early 90's where I felt I understood how Linux worked -- the kernel, startup scripts, drivers, processors, boot tools, etc... I could actually work on all levels of the system to some degree. Those days are long gone. I am far removed from many details of the systems I use today. I used to do a lot of assembly programming on multiple systems. Today I am not sure how most of the systems works in much detail.

To an extent, this is one of the goals, to free up engineers to work on higher level things. Whether it meets that goal in some cases is debatable, and it’s certainly not ideal for us engineers who like to get to the bottom of things.

“working on higher level things” currently implies that depending on many layers of opaque and unreliable lower level hardware and software abstractions is a good idea. I think it is a mistake.

The best conclusion I can come to is "sometimes it works, sometimes it doesn't". Depends on the context. I've seen cases where it works great and other times where it's a huge hassle.

Funny, I feel the exact opposite way. The low level stuff is where all the magic happens, where performance improvements can scale by orders of magnitude rather than linearly with a CTO’s budget. I’d much rather figure out how to condense some over-engineered distributed solution down to one machine with resources to spare.

Seems like ever since Microsoft bought AWS, it's been going down an awful lot.

> Seems like ever since Microsoft bought AWS, it's been going down an awful lot.



Every time Github went down multiple people post on HN saying "every since they were bought by Microsoft, ...". As annoying as those Rust evangelists on every single memory corruption bug.

> As annoying as those Rust evangelists on every single memory corruption bug.

First of all, how dare you!

Second, shoulda used rust ¯\_(ツ)_/¯

I could have written the OP message a year ago -- I used to feel the same way.

Plz don't disparage Rust evangelism!

Rust is awesome. yes it is complex, frequently annoying, easy to learn difficult to master. I'm speaking from a 30 year dev career.

a few months ago I intended to do a quick investigation into RUST to validate my "i really don't need to learn this" specifically for an embedded project. Within a few hours I found I had become a zealot. Rust has too many "omg, i should tell everybody about this" behaviors that I can't even find my favorite aspect yet.

It's equivalent to a lost soul finding Christianity and accepting the lords blessing and forgiveness! The weight that is lifted of being forgiven to your sins resulting == no more guilt, it's all forgiven! immediately reduction of cognitive dissonance. in this example with rust, it's pointer tracking and memory management, but it's basically the same thing. Rust is for the pious developer.

Those people who are still using C++ for fresh starts are the same folks who love to do things the hard & wrong way, or at least those who don't know any better, infidels, unwashed heathen.

Join us. join rUSt.

While I'm not sure whether you're serious ;), to be clear: what annoys me is they don't really understand why we are having "someone pwned your phone via a series of memory corruption bugs" daily.

Until those Rust evangelists managed to rewrite the world with Rust (and I promise you there still will be a lot of security bugs), we still have to fix our shit in a low-cost way and their evangelism does not help at all and is pure annoyance.

> what annoys me is they don't really understand why we are having "someone pwned your phone via a series of memory corruption bugs" daily.

No, I understand it. I started out in vuln research and have been in defense for a decade. It's probably fair to even say I'm an expert on it.

I'm going to keep advocating for rust as one the highest ROIs for improving security.

Obviously while using Arch btw

Didn't know Tim Dillon is hanging on here in HN.

Haha wtf?

That was fun. Badges weren't working (daily checkin required) so the front desk had to manually activate them.

Slack wasn't sending messages and Pagerduty was throwing 500's.

... because you need to contact a server 1000 miles away to issue badges in your building.

This cloud-for-everything-even-local-devices thing is both hilarious and sad.

I wonder if anyone had trouble doing their dishes or laundry today, because I'm sure someone thought dish washers and washing machines needed cloud.

I don't know if you can say an on-premise badge hosting service would be more reliable than the cloud.

well, atleast you have the agency to do something about it yourself.

also, building access systems should be hosted in the building they reside in for security reasons anyways.

This creates some really fun failure cases on the form of "I need to enter the building so anybody can enter the building".

Depending on the cloud is certainly a very stupid decision. keeping everything inside the building is better, but still not ideal.

Any electronic access system like this requires manual backup. As in, some doors with regular locks using physical keys.

it requires an override anyways in case of emergencies like a fire.

Taking badges out of the cloud reduces points of failure by several orders of magnitude.

Cloud-based badges make sense if you have locations with small staffs and no HR people or managers. Like if you're controlling access to a microwave tower on the top of a mountain.

But badges-in-the-cloud for an office building full of people who are being supervised by supposedly trusted managers, and all of whom has been vetted for security and by HR, is just being cheap.

Like the 1980's AT&T commercials used to say: "You get what you pay for."

> Taking badges out of the cloud reduces points of failure by several orders of magnitude.

I'm not convinced that's true, or at least certainly not an order of magnitude. Wouldn't a badge system hosted on-prem also need a user management system (database), a hosted management interface, have a dependency on the LAN, and need most of the same hardware? Such a system would also need to be running on a local server(s), which introduces points of failure around power continuity/surges, physical security, ongoing maintenance, etc.

All of those things would also be needed by the cloud provider, too. Just because it's on-prem doesn't mean it doesn't need servers, power conditioning, physical security, etc. "Cloud" isn't magic fairies. It's just renting someone else's points of failure.

In addition, you're forgetting the thousands of points of failure between the building and the cloud provider. Everything from routers being DDOSed by script kiddies to ransomware gangs attacking infrastructure to Phil McCracken slicing a fiber line with his new post hole digger.

The remote solution requires all of those same things, plus in addition it requires internet connectivity to be up and reliable, the cloud provider be available and the third party company be up and still in business.

Adding complexity and moving parts never reduces points of failure. It can reduce daily operating worries as long as everything works, but it can't reduce points of failure. It also means than someday when it breaks, the root causes will be more opaque.

Within the building’s on premise hosted infrastructure, are they going to buy multiple racks and multiple servers spread far enough apart so that there aren’t many single points of failure that will bring the badge machine down if they fail?

Yes, everyone but you is wrong.

Many logical people have decided to abstract away their soul-crushing anxieties and legal gray area during outages to incredibly stable and well-staffed cloud infrastructure providers.

If you and your team are better at taking care of hardware than an entire building full of highly paid engineering specialists, then that's cool for you, but also, no you're not.

That's not to say you're not capable of running on-prem hardware that is stable.

I'm just saying that the high-handed swiping away of everyone else who's made an incredibly safe and logical decision to host their stuff in the cloud makes me question your general vibe.

> If you and your team are better at taking care of hardware than an entire building full of highly paid engineering specialists

The trade offs aren't quite that simple. Those specialists are necessary because they're building and maintaining infrastructure that's extremely complex since it has a crazy scale and has to be all things to all people. When you're running in-house, your infrastructure is simpler because it's custom tailored to your specific requirements and scale.

There are tradeoffs that make cloud vs local make sense in different contexts and there's no one right answer.

> but also, no you're not.

If you plan to replicate all of AWS I'd agree with you. But if all you need is a handful of servers, you could end up with better uptime doing it in-house just because you don't have all the moving parts that make AWS tick, reducing the chance for something to go wrong.

My bare-metal servers stayed up during both of the recent outages, not because I'm some kind of genius that's better than the AWS engineers but just because it's a dead simple stack that has zero moving parts and my project doesn't require anything more complex.

There is absolutely no reason for a local device (like a door lock or dishwasher as per OP) to depend on any external connectivity. Not to the company on-prem hardware, not to AWS.

Yep, it's broken again. I was trying to install some Thunderbird extensions, and stuff started breaking halfway through. Never thought of an AWS outage borking my mail client I guess...

We lost all public IPv6 in the Linode Newark DC.

This appears to be cross-provider.

Edit: We have IPv6 back.

We're having issues connecting to our EC2 bastions and accessing the us-west-1 dashboard too

EDIT: Cognito auth seems down for us too

EDIT2: our ALBs are timing out as well

EDIT3: us-west-1 looks like working now!

That's the price of PIP culture and burning out your devs. Now noone wants to work at Amazon and they can only hire new grads.

I hear they do get people who want to be able to get experience at AWS's scale, there's only a few places for that.

The thing that really gets me is the reports from the last major outage a few days ago about how pervasive lying inside the company is. This really doesn't work well for engineering and we're possibly seeing the results of that. We should certainly expect to see that becoming visible the more time goes on without a major cultural shift. Which given that the guy who ran AWS now runs all of Amazon.com....

Looks like its taken down SendGrid, NPM, Twitch, Auth0 so far

PlayStation Network went down at the same time.

Stripe as well

Notion as well

How much do you guys think these frequent outages will effect their market share in cloud products?

Is this enough of a push for organizations to actually move over their infrastructure to other providers?

Not at all.

The other cloud providers have had their own outages.

Sadly this, people are entrenched with AWS and the... "We're not the only ones down" thing truly has some effect

Organizations can more easily swallow an AWS failure when they aren't the only ones hit. They move elsewhere, those outages look more unique

Folks may think multi cloud is a good idea... But you're just as likely to suffer from the extra points of failure as you are to benefit

Multi-cloud is such an odd idea to me. You're either building abstractions on top of things like cloud-provider specific implementations of CDNs, K8S, S3, Postgres, etc...or using the cloud just for VMs. The latter would be cheaper with just old-school hosting from Equinix, Rackspace, etc. The former feels like a losing battle.

It’s prompted discussions of building multi regional services in my org but not multi cloud. They would have to really really really screw up for that to happen… maybe be down for like a week or something.

Reminder that the internet was literally invented to avoid this kind of nuclear attack. But i guess people are herdish animals and prefer to die as a group

More like ultimately all these companies buy into a certain form of vendor lock-in and they have no competence or willingness to migrate or even consider the competition. It's starts with "oh I'm just renting a remote virtual server" and in no time it's "Oh, all my stack is tied to AWS proprietary products" because convenience. That's what Amazon wants.

Seems like the Internet level networking is quite robust at this point.

We're having troubles in us-west-2.

Discourse is reporting trouble, too. https://twitter.com/DiscourseStatus/status/14711403698992906...

us-west-1 also seems offline, but us-east-1 (ironically) seems fine

AWS status page shows an update:

> AWS Internet Connectivity (Oregon): 7:42 AM PST We are investigating Internet connectivity issues to the US-WEST-2 Region.

Source: https://status.aws.amazon.com

Oh. not again...

It is surprising that their status page is down too:


Their CDN, CloudFront, always works reliable for me. Couldn't they put the status page on CloudFront?

Takes minutes to update a CloudFront distribution (they say around 5 minutes in their blog post from last year when speed was improved [1]). I think they might want to be able to change it to "everything's back to normal" in an instant, based on the SLA argument I've seen thrown around last time an AWS region was down.

[1] https://aws.amazon.com/blogs/networking-and-content-delivery...

It's minutes to update the distribution settings, but that doesn't have to be the case for the content itself. A much lower cache time can be used.

The status page is working great for me. Did they make it multi-region after the last failure? I'm on the east coast.

Central EU here, appears to be down.

Northern EU, down as well. AWS Management Console in eu-west-1 opens up just fine though.

Edit: Hitting refresh a bunch finally got it open.

Western EU here, appears to be up for me. Maybe a peering issue?

It's back up for me, too, right now. Rather slow, though, and traceroute shows 25 hops. So it might really be peering.

Works for me. It's the usual static page with everything green.

Maybe it is just a static website. Do they even have CSS for red? :D

Not working for me either in the UK.

Down for me, as well.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact