
AWS IAM is having issues again - rootforce
https://twitter.com/RyanGartin/status/1306352941964701696
======
igetspam
Wherever possible, don't use us-east-1. It's one of the older regions and
parts are aging. Yes, I know there are things that are only available in the
old regions but most services are globally available. I've worked with a few
ex AWS SWEs and SREs. They drink the kool-aid and won't say anything bad about
us-east-1 but they also won't launch net-new services there. YMMV

~~~
jen20
This is good advice, but does not help mitigate IAM issues. IAM is a global
service.

~~~
wiredone
IAM is mastered in us-east-1 with all writes going there.

------
peterthehacker
It’s funny how on Twitter this currently has 3 retweets and 17 likes but on HN
it has 110 points!

~~~
specialp
I started seeing this while running Terraform and immediately went to Twitter
to see if it is down for everyone or just me. Same experience. From now on I
will go to HN :)

~~~
kohtatsu
[https://hckrnews.com](https://hckrnews.com) is an alt-reader that sorts the
front page chronologically which helps.

------
jaaames
I guess I'm going outside today.

------
banana_giraffe
What a perfect time to onboard some new employees.

Blah.

------
s09dfhks
I modified some of our IAM policies earlier this afternoon, followed by the
pages that some of our teams were having IAM issues, caused me great
discomfort

~~~
rootsudo
Congratulations, this time it wasn't you!

Maybe, possibly, hopefully...

------
rootforce
This probably needs a better link, but the AWS status page shows everything
up.

UPDATE: Status page now shows it
[https://status.aws.amazon.com/#](https://status.aws.amazon.com/#)

~~~
QuinnyPig
[https://stop.lying.cloud](https://stop.lying.cloud) if you (like me) keep
getting the order of the aws, amazon, and status confused.

~~~
tgsovlerkhgsel
Is this just a gimmick run by AWS and the "honest" in the logo is just a play
on the host name, or is it some third-party version that adds information that
AWS isn't reporting?

~~~
outworlder
Third party obviously. Check whois.

~~~
tgsovlerkhgsel
Does it add/edit any information, or is it just a proxy?

~~~
tuananh
probably just cname.

------
runawaybottle
What’s everyone’s back up plans? Just a ‘We’ll be right back’ page?

~~~
snazz
Switching to that sort of page is probably the most cost-effective solution
for small businesses. I've heard of larger companies running a completely
redundant hot standby on another independent cloud platform and switching DNS
over to the standby when something goes wrong. With auto-scaling, you're not
paying to have the standby running at full throttle. Of course, you have to
exclusively use services that have equivalents on the other cloud provider.

~~~
redis_mlc
> What’s everyone’s back up plans? Just a ‘We’ll be right back’ page?

I've done a lot of cloud site HA work on large sites.

Waiting out the cloud outage so far ends up being the best solution for almost
all companies, from both engineering and business standpoints. Eat the outage,
but continue with a known-working site afterwards. You just blame the cloud
provider for the downtime.

> I've heard of larger companies running a completely redundant hot standby on
> another independent cloud platform and switching DNS over to the standby
> when something goes wrong.

In theory, that makes sense. It practise, it almost never works.

If by "independent cloud platform" you mean another AZ or region in the same
cloud, that is often attempted and can work reasonably well. If you mean
failover from AWS to GCP, then that's unlikely, since everything is different.

An example is whenever DynDNS goes down for 2-3 hours, and everybody builds
out flaky failover tools that are less reliable than their original DNS
provider - and have to be maintained forever. Might work with one or two
domains, becomes a huge ongoing problem with dozens of them. Also, DNS mgmt.
APIs are flaky in several dimensions (availability, versioning, parameters,
etc.)

Another is that you can't failover to another location that doesn't have all
your data, current certificates, monitoring, etc., and the failover site needs
the capacity of the original site to work. That costs ongoing money and time,
and you never know how well the failover will work or how it will perform.

An example is Heartland, one of the biggest US payment providers, who failed
over to another location and took a 5 day outage. Or gitlab, who took a one
day outage because of database isues.

I have (automatically) failed over a large site for a publicly-traded company
from one AWS region to another, but that took a year of work to setup, and I
understood almost all aspects of the site. Afterwards, I realized that almost
nobody really has time to organize that either at a conceptual or engineering
level. And organizations don't recognize Herculean efforts like that, so think
twice beforehand.

Key point: always involve your DBA from the beginning when doing a project
like this.

------
dvtrn
Did this link get changed to a twitter post from something else?

~~~
rootforce
Yes, it was a link to nitter.net(an alternative twitter front end) due to HN
guidelines for posting links to original sources.

------
NathanWilliams
Noticed something odd today I think is connected to this.

The other day we started using Access Advisor, and we found some of our KMS
key policies with a Principal of '*'.

It wasn't marked as globally open, so we planned to fix them a little later.

This morning we found that status had changed.

While we were in the wrong to begin with, it was a little surprising to find
the interpretation of the key policy changing overnight.

Of course it became our top priority and is now fixed. Something to look out
for...

------
TheYahiaBakou
I just feel bad for the oncalls...

------
BillinghamJ
Looks to be affecting all regions - at least within the standard aws
partition. Not sure about aws-cn and aws-us-gov

~~~
ArchOversight
Not seeing any issues in aws-us-gov.

~~~
krrishd
TIL of aws-us-gov

~~~
dijit
usually referred to as gov-cloud.

Has it's own version of everything, physically segmented, even for global
systems such as IAM.

You have to be a US Citizen to work on it and you need special security
clearances.

~~~
cperciva
You might be getting GovCloud and AWS Secret Region confused. I've been told
that I can access GovCloud if I cross the border from Canada, despite not
being a US Citizen. (This came up in the context of providing FreeBSD AMIs.)

~~~
BillinghamJ
I believe the secret region is just part of GovCloud. Afaik, GovCloud is just
the marketing name for aws-us-gov

~~~
schwank
GovCloud is a separate partition from Secret. Different regulatory framework
alignment and customer onboarding.

GovCloud customers only need be a US person or entity, beyond that any further
regulatory alignment is up to the customer. AWS does not audit the IAM user
base for nationality or any compliance requirements.

Disclaimer: I am an AWS Public Sector Solutions Architect.

~~~
BillinghamJ
Ah it looks to be aws-iso (c2s.ic.gov, top secret) and aws-iso-b
(sc2s.sgov.gov, secret)?

------
TazeTSchnitzel
Off-topic: I hadn't heard of nitter.net before, it seems pretty cool.

~~~
dang
Sorry to disappoint, but we've changed the URL from
[https://nitter.net/RyanGartin/status/1306352941964701696#m](https://nitter.net/RyanGartin/status/1306352941964701696#m)
to the original source, as the site guidelines ask:
[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html).

~~~
rootforce
Yep, sorry about that. Thanks for the update.

------
holler
my site is down on all environments (us-east-1). oy vey

~~~
leesalminen
Is your site dependent on IAM?

~~~
andrewxdiamond
Pretty hard to not be dependent on IAM. Authentication and authorization are
some of the most core concepts you can have

~~~
cperciva
Amazon says that "The issue continues to affect create, describe, modify or
delete of IAM accounts and roles. [...] Authentication using IAM accounts and
roles are not affected."

So it's entirely possible to depend on IAM but not in a way which this is
breaking.

~~~
sk5t
This makes sense, observing services that rely on IAM/STS very, very routinely
--but without changing IAM properties--and no alerts popped up during this
outage.

------
sebmellen
I can always rely on HN for finding out why AWS is broken... Is it time to
switch to GCP?

~~~
renewiltord
No, sometimes you GCP failures posted here too. Your best bet is to move to
Oracle Cloud. I have never seen an Oracle Cloud (not Oracle Data Cloud, which
is the ad-tech product) outage posted here so I think they must have 100%
uptime.

~~~
MattSayar
Sarcasm detected! There's always going to be pros and cons to every provider,
so the obvious choice is host everything on my home theater PC since I never
turn that thing off.

~~~
dijit
Please don't consider this a counterpoint, but I have had single physically
deployed servers that have significantly higher average uptime than my ec2
instances in AWS us-east-1 in the last 5 years.

(significantly higher == 100% availability, it hasn't gone down... yet)

~~~
viraptor
[https://en.wikipedia.org/wiki/Survivorship_bias](https://en.wikipedia.org/wiki/Survivorship_bias)

~~~
theli0nheart
This doesn't apply. Having a server fail on you doesn't preclude you from
posting to HN.

~~~
viraptor
But it becomes uninteresting, which causes self-censorship. How many times did
you hear about someone's server running for 12 minutes? How many times about
server running for over a year? Nobody posts about the first one. (Unless it
was a spectacular failure for some reason)

Similar issue comes with estimating how reliable things are. People are more
likely to respond "I had an issue with X too, here's my story" rather than
"all good, nothing to report".

------
gscho
Let's all move to serverless!

~~~
ryanmarsh
Legit not sure if you're being sarcastic or not.

~~~
loopdoend
Maybe your sarcasm detector is broken.

~~~
ryanmarsh
I'm not a robot, I'm a person, please speak to me as anyone you'd meet in
person.

------
tus88
It sirtainly is. And I came in this morning specifically to creating some
Lambda roles to test. Fark.

------
panny
I noticed (pre outage) IAM console won't work at all if I --disable-reading-
from-canvas in my launch args to prevent fingerprinting. All the other service
consoles I use work. I have to have a special config for my browser just for
AWS because of it. Wishful thinking, but maybe they're fixing that just for
me.

