More

ipmb · on Aug 8, 2024

I'd be happy to share my experience with you. We've solved problems like this for a lot of folks with small or non-existent internal tech teams. I agree with the small team of experts approach, but there are pitfalls there too.

Feel free to reach out pete AT lincolnloop.com

ipmb · on Oct 3, 2022

Any idea where this ranking comes from?

> Fandom Will Now Rank as the #14 Ad Supported Site in the U.S.

ipmb · on April 27, 2022

We're building one at https://apppack.io. It's a Heroku-like experience in your own AWS account leveraging their suite of managed services.

ipmb · on Dec 27, 2021

My understanding is that you are legally required to carry third-party liability (TPL) insurance in MX, but credit cards only provide collision/damage insurance.

Do you have a CC that explicitly covers TPL?

ipmb · on Dec 7, 2021

Looks like they've acknowledged it on the status page now. https://status.aws.amazon.com/

> 8:22 AM PST We are investigating increased error rates for the AWS Management Console.

> 8:26 AM PST We are experiencing API and console issues in the US-EAST-1 Region. We have identified root cause and we are actively working towards recovery. This issue is affecting the global console landing page, which is also hosted in US-EAST-1. Customers may be able to access region-specific consoles going to https://console.aws.amazon.com/. So, to access the US-WEST-2 console, try https://us-west-2.console.aws.amazon.com/

jesboat · on Dec 7, 2021

> This issue is affecting the global console landing page, which is also hosted in US-EAST-1

Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?

At a different (unnamed) FAANG, we considered it unacceptable to have anything depend on a single region. Even the dinky little volunteer-run thing which ran https://internal.site.example/~someEngineer was expected to be multi-region, and was, because there was enough infrastructure for making things multi-region that it was usually pretty easy.

all_usernames · on Dec 7, 2021

Every damn Well-Architected Framework includes multi-AZ if not multi-region redundancy, and yet the single access point for their millions of customers is single-region. Facepalm in the form of $100Ms in service credits.

cronix · on Dec 7, 2021

> Facepalm in the form of $100Ms in service credits.

It was also greatly affecting Amazon.com itself. I kept getting sporadic 404 pages and one was during a purchase. Purchase history wasn't showing the product as purchased and I didn't receive an email, so I repurchased. Still no email, but the purchase didn't end in a 404, but the product still didn't show up in my purchase history. I have no idea if I purchased anything, or not. I have never had an issue purchasing. Normally get a confirmation email within 2 or so minutes and the sale is immediately reflected in purchase history. I was unaware of the greater problem at that moment or I would have steered clear at the first 404.

jjoonathan · on Dec 7, 2021

Oh no... I think you may be in for a rough time, because I purchased something this morning and it only popped up in my orders list a few minutes ago.

toomanyrichies · on Dec 8, 2021

They're also unable to refund Kindle book orders via their website. The "Request a refund" page has a 500 error, so they fall back to letting you request a call from a customer service rep. Initiating this request also fails, so they then fall back to showing a 1-888 number that the customer can call. Of course, when I tried to call, I got "All circuits are busy".

vkgfx · on Dec 7, 2021

>Facepalm in the form of $100Ms in service credits.

Part of me wonders how much they're actually going to pay out, given that their own status page has only indicated five services with moderate ("Increased API Error Rates") disruptions in service.

sitharus · on Dec 8, 2021

That public status page has no bearing on service credits, it's a statically hosted page updated when there's significant public impact. A lot of issues never make it there.

Every AWS customer has a personal health dashboard that links the issues to their services which is updated much faster, and links issues to your affected resources. Additionally requests for credits are done by the customer service team who have even more information.

JPKab · on Dec 7, 2021

Utter lies on that page. Multiple services listed as green aren't working for me or my team.

GabrielBen · on Dec 7, 2021

This point is repeated often, and the incentives for Amazon to downplay the actual downtime are definitely there.

Wouldn't affected companies be incentivized to make a lawsuit about AMZ lying about status? It would be easy to prove and costly to defend from AWS standpoint.

CrazyCatDog · on Dec 7, 2021

Suggesting that when the status page sends a status request and hears no response—it defaults to green—hear no evil and see no evil —> report no evil

Either way—overt lies or engineering incompetence—it’s disappointing!

skj · on Dec 8, 2021

Pretty low chance that the status page is automated, especially via health checks. I imagine it's a static asset updated by hand.

jandrese · on Dec 8, 2021

Or the service that updates the status page runs out of us-east-1.

erhk · on Dec 8, 2021

It has customer relationship implications. I guarantee you it is updated by a support agent.

zainhoda · on Dec 8, 2021

https://stop.lying.cloud

brentcetinich · on Dec 8, 2021

Don’t think there is an sla for the console , so you would not be claiming anything for the console at least

stevehawk · on Dec 7, 2021

I don't know if that should surprise us. AWS hosted their status page in S3 so it couldn't even reflect its own outage properly ~5 years ago. https://www.theregister.com/2017/03/01/aws_s3_outage/

tekromancr · on Dec 7, 2021

I just want to serve 5 terabytes of data

mrep · on Dec 7, 2021

Reference for those out of the loop: https://news.ycombinator.com/item?id=29082014

ithkuil · on Dec 7, 2021

One region? I forgot how to count that low

edoceo · on Dec 7, 2021

It's like three regions - when two of them explode.

Two is one & one is none.

ithkuil · on Dec 9, 2021

the obvious solution is to put all internet in one region so that when that one explodes nobody notices your little service

sangnoir · on Dec 8, 2021

> At a different (unnamed) FAANG

I'm guessing Google, on the basis of the recently published (to the public) "I just want to serve 5TB"[1] video. If it isn't Google, then the broccoli man video is still a cogent reminder that unyielding multi-region rigor comes with costs.

1. https://www.youtube.com/watch?v=3t6L-FlfeaI

jesboat · on Dec 9, 2021

It's salient that the video is from 2010. Where I was (not Google), the push to make everything multi-region only really started in, maybe, 2011 or 2012. And, for a long time, making services multi-region actually was a huge pain. (Exception: there was a way to have lambda-like code with access to a global eventually-consistent DB.)

The point is that we made it easier. By the time I left, things were basically just multi-region by default. (To be sure, there were still sharp edges. Services which needed to store data (like, databases) were a nightmare to manage. Services which needed to be in the same region specific instances of other services, e.g. something which wanted to be running in the same region as wherever the master shard of its database was running, were another nasty case.)

The point was that every services was expected to be multi-region, which was enforced by regular fire drills, and if you didn't have a pretty darn good story about why regular announced downtime was fine, people would be asking serious questions.

And anything external going down for more than a minute or two (e.g. for a failover) would be inexcusable. Especially for something like a bloody login page.

alfiedotwtf · on Dec 7, 2021

Maybe has something to do with CloudFront mandating certs to be in us-east-1?

tekromancr · on Dec 7, 2021

YES! Why do they do that? It's so weird. I will deploy a whole config into us-west-1 or something; but then I need to create a new cert in us-east-1 JUST to let cloudfront answer an HTTPS call. So frustrating.

jamesfinlayson · on Dec 7, 2021

Agreed - in my line of work regulators want everything in the country we operate from but of course CloudFront has to be different.

tekromancr · on Dec 8, 2021

Wouldn't using a global CDN for everything be off the table to begin with, in that case?

jamesfinlayson · on Dec 9, 2021

Apparently it's okay for static data (like a website hosted in S3 behind CloudFront) but seeing non-Australian items in AWS billing and overviews always makes us look twice.

ehsankia · on Dec 7, 2021

Forget the number of regions. Monitoring for X shouldn't even be hosted on X at all...

mise_en_place · on Dec 7, 2021

Exactly. And I’m surprised AWS doesn’t have failover. That’s basic SOP for an SRE team.

hericium · on Dec 8, 2021

> Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?

They're cheap. HA is for their customers to pay more, not for Amazon which often lies during major outages. They would lose money on HA and they would lose money on acknowledging downtimes. They will lie as long as they benefit from it.

sheenobu · on Dec 7, 2021

I think I know specifically what you are talking about. The actual files an engineer could upload to populate their folder was not multi-region for a long time. The servers were, because they were stateless and that was easy to multi-region, but the actual data wasn't until we replaced the storage service.

jesboat · on Dec 9, 2021

I think the storage was replicated by 2013? Definitely by 2014. It didn't have automated failover, but failover could be done, and was done during the relevant drills for some time.

I think it only stopped when the storage services got to the "deprecated, and we're not bothering to do a failover because dependent teams who care should just use something else, because this one is being shut down any year now". (I don't agree with that decision, obviously ;) but I do have sympathy for the team stuck running a condemned service. Sigh.)

After stuff was migrated to the new storage service (probably somewhere in the 2017-2019 range but I have no idea when), I have no idea how DR/failover worked.

sheenobu · on Dec 11, 2021

Thank you for the sympathy. If we are talking about the same product then it was most likely backed by 3 different storage services over its lifespan, 2013/2014 was a third party product that had some replication/fail-over baked in, 2016-2019 on my team with no failover plans due to "deprecated, dont bother putting anything important here", then 2019 onward with "fully replicated and automatic failover capable and also less cost-per-GB to replicate but less flexible for the existing use cases".

balls187 · on Dec 8, 2021

MAANG*

How long before Meta takes over for Facebook?

mastax · on Dec 8, 2021

Well, alphabet needs to take over for Google first.

boringg · on Dec 8, 2021

So its MAAAAN? Seems disappointing

rdiddly · on Dec 9, 2021

That's just, like, your opinion maaaan

majewsky · on Dec 9, 2021

I like MAGMA (Meta, Amazon, Google, Microsoft, Apple).

Especially when you are getting burned by an outage.

ents · on Dec 8, 2021

MANGA

jabiko · on Dec 7, 2021

Yeah, but I still have a different understanding what "Increased Error Rates" means.

IMHO it should mean that the rate of errors is increased but the service is still able to serve a substantial amount of traffic. If the rate of errors is bigger than, let's say, 90% that's not an increased error rate, that's an outage.

thallium205 · on Dec 7, 2021

They say that to try and avoid SLA commitments.

jiggawatts · on Dec 7, 2021

Some big customers should get together and make an independent org to monitor cloud providers and force them to meet their SLA guarantees without being able to weasel out of the terms like this…

jeremyjh · on Dec 7, 2021

They are still lying about it, the issues are not only affecting the console but also AWS operations such as S3 puts. S3 still shows green.

lsaferite · on Dec 7, 2021

It's certainly affecting a wider range of stuff from what I've seen. I'm personally having issues with API Gateway, CloudFormation, S3, and SQS

midasuni · on Dec 7, 2021

Our corporate ForgeRock 2FA service is apparently broken. My services are behind distributed x509 certs so no problems there.

pbalau · on Dec 7, 2021

> We are experiencing API and console issues in the US-EAST-1 Region

jeremyjh · on Dec 7, 2021

I read it as console APIs. Each service API has its own indicator, and they are all green.

packetslave · on Dec 7, 2021

IAM is a "global" service for AWS, where "global" means "it lives in us-east-1".

STS at least has recently started supporting regional endpoints, but most things involving users, groups, roles, and authentication are completely dependent on us-east-1.

Rantenki · on Dec 7, 2021

Yep, I am seeing failures on IAM as well:

   aws iam list-policies
  
  An error occurred (503) when calling the ListPolicies operation (reached max retries: 2): Service Unavailable

bamboozled · on Dec 8, 2021

I'm seeing errors for things that worked fine, like policies that had no issue now are saying "access denied".

I'm wondering if the cause of the outage has to do with something changing in the way IAM is interpreted ?

silverlyra · on Dec 7, 2021

Same here. Kubernetes pods running in EKS are (intermittently) failing to get IAM credentials via the ServiceAccount integration.

Rapzid · on Dec 8, 2021

I still can't create/destroy/etc CloudFront distros. They are stuck in "pending" indefinitely.

dang · on Dec 7, 2021

Ok, we've changed the URL to that from https://us-east-1.console.aws.amazon.com/console/home since the latter is still not responding.

There are also various media articles but I can't tell which ones have significant new information beyond "outage".

stephenr · on Dec 7, 2021

When I brought up the status page (because we're seeing failures trying to use Amazon Pay) it had EC2 and Mgmt Console with issues.

I opened it again just now (maybe 10 minutes later) and it now shows DynamoDB has issues.

If past incidents are anything to go by, it's going to get worse before it gets better. Rube Goldberg machines aren't known for their resilience to internal faults.

JPKab · on Dec 7, 2021

As a user of Sagemaker in us-east-1, I deeply fucking resent AWS claiming the service is normal. I have extremely sensitive data, so Sagemaker notebooks and certain studio tools make sense for me. Or DID. After this I'm going back to my previous formula of EC2 and hosting my own GPU boxes.

Sagemaker is not working, I can't get to my work (notebook instance is frozen upon launch, with zero way to stop it or restart it) and Sagemaker Studio is also broken right now.

The length of this outage has blown my mind.

wahern · on Dec 8, 2021

You don't use AWS because it has better uptime. If you've been around the block enough times, this story has always rung hollow.

Rather, you use AWS because when it is down, it's down for everybody else as well. (Or at least they can nod their head in sympathy for the transient flakiness everybody experiences.) Then it comes back up and everybody forgets about the outage like it was just background noise. This is what's meant by "nobody ever got fired for buying (IBM|Microsoft)". The point is that when those products failed, you wouldn't get blamed for making that choice; in their time they were the one choice everybody excused even when it was an objectively poor choice.

As for me, I prefer hosting all my own stuff. My e-mail uptime is better than GMail, for example. However, when it is down or mail does bounce, I can't pass the buck.

markus_zhang · on Dec 8, 2021

Looks like they removed some 9s from availability in one day. I wonder if more are considering moving away from cloud.

guenthert · on Dec 7, 2021

Uh, four minutes to identify the root cause? Damn, those guys are on fire.

Frost1x · on Dec 7, 2021

Identify or to publicly acknowledge? Chances are technical teams knew about this and noticed it fairly quickly, they've been working on the issue for some time. It probably wasn't until they identified the root cause and had a handful of strategies to mitigate with confidence that they chose to publicly acknowledge the issue to save face.

I've broken things before and been aware of it, but didn't acknowledge them until I was confident I could fix them. It allows you to maintain an image of expertise to those outside who care about the broken things but aren't savvy to what or why it's broken. Meanwhile you spent hours, days, weeks addressing the issue and suddenly pull a magic solution out of your hat to look like someone impossible to replace. Sometimes you can break and fix things without anyone even knowing which is very valuable if breaking something had some real risk to you.

sirmarksalot · on Dec 7, 2021

This sounds very self-blaming. Are you sure that's what's really going through your head? Personally, when I get avoidant like that, it's because of anticipation of the amount of process-related pain I'm going to have to endure as a result, and it's much easier to focus on a fix when I'm not also trying to coordinate escalation policies that I'm not familiar with.

czbond · on Dec 7, 2021

:) I imagine it went like this theoretical Slack conversation:

> Dev1: Pushing code for branch "master" to "AWS API". > <slackbot> Your deploy finished in 4 minutes > Dev2: I can't react the API in east-1 > Dev1: Works from my computer

flerchin · on Dec 7, 2021

Outage started at 731 PST from our monitoring. They are on fire, but not in a good way.

tonyhb · on Dec 7, 2021

It was down as of 7:45am (we posted in our engineering channel), so that's a good 40 minutes of public errors before the root cause was figured out.

giorgioz · on Dec 7, 2021

I'm trying to login in the AWS Console from other regions but I'm getting HTTP 500. Anyone managed to login in other regions? Which ones?

Our backend is failing, it's on us-east-1 using AWS Lambda, Api Gateway, S3

pbreit · on Dec 7, 2021

I like how 6 hours in: "Many services have already recovered".

bamboozled · on Dec 8, 2021

Not even close for us.

bobviolier · on Dec 7, 2021

https://status.aws.amazon.com/ still shows all green for me

banana_giraffe · on Dec 7, 2021

It's acting odd for me. Shows all green in Firefox, but shows the error in Chrome even after some refreshes. Not sure what's caching where to cause that.

junyoon · on Dec 8, 2021

firefox has more aggressive caching than other browsers I think

ipmb · on Dec 7, 2021

Yes, definitely something going on.

ipmb · on Nov 30, 2021

https://www.encode.io/databases/ maybe?

ipmb · on July 22, 2021

Cool project! I'm building something similar at https://apppack.io. It's not terraform based, but also helps setup, manage, and orchestrate AWS resources for devs.

Using AWS managed services is a huge win for maintainability. A lot of host-your-own PaaS tools are spinning up EC2 instances that you're then responsible for maintaining/patching/securing.

ipmb · on Feb 15, 2021

If you're on AWS, you might be interested our product https://apppack.io.

ForHackernews · on Feb 16, 2021

Thanks for the link. Honestly one of my goals is to avoid becoming too dependent on on proprietary clouds. If I wanted something that only works on Amazon, there's always ElasticBeanstalk/Amplify & friends.

ipmb · on Feb 4, 2021

shameless plug... you might be interested in https://apppack.io. I built it for very similar reasons.