
S3 was down - rabidonrails
http://status.aws.amazon.com/
======
scrollaway
Quick tip for those using static (jekyll, hugo...) sites on s3: If you have
cloudflare in front of it, you can turn on aggressive html caching by creating
a page rule: __example.com /* => Cache level => Cache everything

[https://support.cloudflare.com/hc/en-
us/articles/200172256-H...](https://support.cloudflare.com/hc/en-
us/articles/200172256-How-do-I-cache-static-HTML-)

This downtime made me realize we weren't caching any html on cloudflare. I
just turned it on and all our static sites are doing fine now (and our bills
are smaller!).

If you're fancy you can even programmatically purge the cache when you do CI
deploys using the cloudflare API.

~~~
kccqzy
Programmatically purging caches is slow and costs extra. Just use normal
cache-busting techniques like a different URL after a deployment.

EDIT: Oops. Misread cloudflare/CloudFront.

~~~
manigandham
Cloudflare and all other modern CDNs do this instantly and for free. Only the
legacy ones take longer (usually minutes at most) and might charge a small
extra.

~~~
jsjohnst
Akamai takes far longer than a few minutes if you use all their POPs. It's in
the 6-12hr range.

~~~
jsizzle
This is not true. Purges on akamai take less than ten minutes for entire cp
codes or specific objects. Deploying new cache rules is less than an hour in
many cases

~~~
jsjohnst
How many POPs are you using? I can assure you, I am not wrong here and have
internal knowledge of why if you are truly using all POPs it takes so long.

~~~
notyourday
That's not correct. Site delivery on Akamai globally is purged within several
minutes

~~~
jsjohnst
Look, I know what I'm talking about. Just because you can purge faster because
of a different deployment doesn't mean that's the case for everyone. "Fast
Purge" is not available for every type of deployment.

~~~
notyourday
You are confused.

"Site Delivery" is purged globally within 7 minutes.

"Fast purge" enabled products (site delivery) is purged globally in under 20
seconds.

All assets for websites are site delivery. Only media services family of
products are not site delivery. Those purge within a couple of hours. If you
are someone who uses media services products, you know how to force Akamai to
go back to origin globally instantaneously and how to trigger invalidation. In
media services you should never serve stale.

~~~
jsjohnst
As I said, it depends on your deployment. Thanks for the reply and confirming
what I said was correct, that under some deployments it can be a multi-hour
purge.

------
mrb
Every time there is an outage at some cloud provider, I enjoy knowing that my
site has maintained 100% availability since its launch. I run 3 redundant
{name servers, web servers} pairs on 3 VPS hosted at 3 different providers on
3 different continents. Even if the individual providers are available only
98% of the time—7 days of downtime per year—my setup is expected to still
provide five nines availability (details: [http://blog.zorinaq.com/release-of-
hablog-and-new-design/#co...](http://blog.zorinaq.com/release-of-hablog-and-
new-design/#conclusion))

Edit: It's not about bragging. It's not about the ROI. I want to (1)
experiment & learn, and most importantly (2) show what is possible with very
simple technical architectures. HN is the ideal place to "show and tell" these
sorts of projects.

~~~
brandon272
I guess the part I'm confused about here are the DNS records and DNS pinning.
If your zone returns 1.1.1.1 and 2.2.2.2 and 3.3.3.3 as IP addresses to use,
and I'm browsing your site while resolving to 1.1.1.1 and 1.1.1.1 goes down --
your site will appear down for me, correct?

My browser won't automatically try 2.2.2.2 or 3.3.3.3, or would it?

~~~
fulafel
Not the op, but won't browsers fall back to the other IPs? In olden days some
browsers would not handle multiple IPs per DNS record correctly, hopefully it
works now.

~~~
brandon272
Not sure, that's what I'm wondering. Sounds like a good solution if they
automatically do fallback. I also wonder how browsers behave depending on how
the server at a particular IP address is responding, as they can respond in
different ways. (i.e. a server might respond with an error, accept a request
but timeout on a response, or appear totally unreachable)

Any HA solution I've seen that attempts to reliably achieve this five nines
capability relies on network-level things like virtual IP's and what not. And
I don't consider it a five nines solution if only some customers can access
it. How the browser behaves in this case could be critical depending on how
your visitors use the site. I would not consider a site "up" if it's only
available to some people and not others.

~~~
jsmthrowaway
> And I don't consider it a five nines solution if only some customers can
> access it.

Well, that depends on the SLA/SLO, which is really what "nines" is speaking
to. Intuitively I agree, but it can, realistically, not be the case and be
"valid". Doesn't make it right. Just is.

------
ksenzee
Nice to see this getting reported on the AWS status page during the actual
event, for once.

~~~
mi100hael
It was around 10 mins before anything showed up, actually, and 'status1' for a
while after that.

~~~
ksenzee
For anyone else, I'd say that's letting us know a bit late. For AWS? Huge
improvement.

------
zedpm
Kudos for Amazon sorting out their status page. The last time this happened,
the status page didn't show anything until hours after the outage began. I
just noticed this one maybe 15 minutes ago, and 5-10 minutes later they
acknowledged the problem.

On the other hand, I'm feeling a strong case of "Not this shit again".
Wondering if US-East-1 is more trouble than it's worth, as these outages seem
to happen mostly there.

~~~
sitharus
Us-east-1 is where all the shiny new features are released and where most
customers are. From experience the other regions are much more reliable.

~~~
lightbyte
I thought us-east-1 had the most issues because it is the oldest region and
thus uses the oldest hardware.

~~~
BillinghamJ
It’s primarily just due to the incredible scale of the region. It’s easily
twice as large as any other, so any issues relating to scale appear there
first and are fixed before they affect any of the other regions.

~~~
lozenge
It might also be related to IAM and some other stuff being "based" there.

~~~
yeukhon
What do you mean "based" there?

~~~
BillinghamJ
As in, most global services run primarily within us-east-1, and have only
subsystems running in each region.

For example, it's likely that the actual master data stores for IAM are solely
within us-east-1, but the key data is cached and services run in each region.

Similarly, Cloudfront is theoretically a global service, but only ACM
certificates set up within us-east-1 can be used with Cloudfront.

------
andrewaylett
In us-east-1. Don't worry me like that! It's definitely not down _globally_ or
else I'd have been paged by now...

~~~
hsod
Ha, I got paged anyway since one of our monitors hits the S3 API which is
hosted in us-east-1

~~~
gboudrias
Does this mean your infrastructure is actually dependent on multiple regions?
Just curious.

------
mattrjacobs
If you want to understand the impact of an S3 outage (or many other kinds)
ahead of time, Gremlin ([http://gremlininc.com](http://gremlininc.com)) has
built a tool to run these scenarios on your infrastructure.

Happy to answer any questions about the tool, either here or over email.

(Disclaimer: I work for Gremlin)

------
temuze
Now let's see who learned from the last major outage 200 days ago :)

~~~
pilom
This was us! Added CloudFlare in front of every bucket with our own
"cdn.example.com" DNS name and changed every reference to s3 in code. Didn't
even know S3 had problems today until I saw this on HN. (note: This outage
taught us to set up monitoring of the S3 assets separate from the "cdn")

------
alexandros
Lots of people confirming this on Twitter[1], and we're also seeing it at
resin.io

[1]:
[https://twitter.com/AlecSanger/status/908402829349572608](https://twitter.com/AlecSanger/status/908402829349572608)

------
vim_wannabe
The great thing about AWS is that you shift the burden of responsibility to
Amazon.

~~~
notyourday
[https://www.whoownsmyavailability.com/](https://www.whoownsmyavailability.com/)

------
cddotdotslash
Good time to setup cross region replication with multiple CouldFront origins.
I made a tutorial site to show you how: [https://spwa.io](https://spwa.io)

------
bm1362
They've fixed the status indicator, it's now a static asset on the status
page:

* [http://status.aws.amazon.com/images/status1.gif](http://status.aws.amazon.com/images/status1.gif)

------
discordianfish
I've just launched [https://pub.latency.at/](https://pub.latency.at/) which
give you free Prometheus metrics for various cloud endpoints like S3. Doesn't
look too bad right now though.

It's also hosted on S3 but still up and running. (The main service is
independent of S3 anyway after setup)

~~~
discordianfish
I've added a play-with-docker button to give you a prometheus+grafana instance
scraping this.

Spun up this few hours ago, will be gone soon but if anyone wants to check it
out without waiting for the metrics to trickel in:

[http://pwd10-0-33-3-3000.host3.labs.play-with-
docker.com/das...](http://pwd10-0-33-3-3000.host3.labs.play-with-
docker.com/dashboard/db/http-duration-site-per-region?refresh=1m&orgId=1&var-
site=s3.ap-northeast-1.amazonaws.com&var-site=s3.ap-
northeast-2.amazonaws.com&var-site=s3.ap-south-1.amazonaws.com&var-site=s3.ap-
southeast-1.amazonaws.com&var-site=s3.ap-southeast-2.amazonaws.com&var-
site=s3.ca-central-1.amazonaws.com&var-site=s3.eu-central-1.amazonaws.com&var-
site=s3.eu-west-1.amazonaws.com&var-site=s3.eu-west-2.amazonaws.com&var-
site=s3.sa-east-1.amazonaws.com&var-site=s3.us-east-1.amazonaws.com&var-
site=s3.us-east-2.amazonaws.com&var-site=s3.us-west-1.amazonaws.com&var-
site=s3.us-west-2.amazonaws.com&var-region=All&var-ssl_expires=All)

------
Slippery_John
As always I recommend people practice Chaos Engineering [0] to minimize the
impact of such events. Even if the complexity of cross-region failover is to
expensive, having some sort of graceful failure is preferable to simply dying.
Netflix's Simian Army [1] toolset is particularly useful here, though I don't
see an OSS version of Chaos Kong (which simulates the failure of an entire
region) sadly.

[0]: [http://principlesofchaos.org/](http://principlesofchaos.org/) [1]:
[https://github.com/Netflix/SimianArmy/wiki](https://github.com/Netflix/SimianArmy/wiki)

------
RomanPushkin
It happens pretty often, so I started awesome list:
[https://github.com/ro31337/awesome-aws-
alternatives](https://github.com/ro31337/awesome-aws-alternatives)

Please contribute if you know alternatives.

------
ak217
What do people use for mocking S3 behavior with high fidelity, to test
exponential backoff and other resilience strategies in this situation?

I've seen a handful of test libraries but none of them seem to make realistic
up-to-date error injection a priority.

~~~
YawningAngel
We use S3-proxy internally and are pretty happy with it.

~~~
rob-olmos
[https://github.com/andrewgaul/s3proxy](https://github.com/andrewgaul/s3proxy)
? (no hyphen)

------
rhelsing
I'm getting issues as well. Here's to hoping this issue is just a hiccup and
not like the last outage.. otherwise, it may be time to seriously consider
alternatives. Anyone know of a good solution for a self-hosted s3-esque
service?

~~~
benwilber0
> it may be time to seriously consider alternatives

they have multi-az support if you use it

~~~
zedpm
Not exactly. S3 doesn't have AZs at all, it's only split on regions. Further,
a bucket can exist in only one region. You can set up cross-region
replication, but you of course need to flip the bucket coordinates in all your
applications to fail over. It's not nearly as easy as Multi-AZ support in
things like RDS.

~~~
yeukhon
> split on regions

I think you really meant s3 objects are redundant in each region, which
actually is spanned across multiple DCs.

~~~
zedpm
I meant that a bucket exists in only one region and that you have to replicate
to a different bucket if you want to do anything to improve S3 availability.

------
rdtsc
Still 99.99% availability just the averaging time was increased by another 100
years :-)

~~~
ceejayoz
S3 commits to being 99.9% available (~9 hours a year). You may be thinking of
the object _durability_ numbers.

------
ajoy
Most of the recent issues seems to be in this particular colo (us-east-1),
which is also their oldest facility. Maybe its prudent to move resources to
other newer colos or pay for multi-zone availability.

~~~
twistedpair
Amen. We moved everything to US-West-2 years ago and never looked back. When
we see VA on fire, we just sigh.

~~~
yeukhon
Well, if your users are in the east cost / London, you want to build something
closer without spinning up one in Europe yet, us-east-1 is your choice. So..
YMV.

------
chosenken
Yup, its taking all our stuff down now too. We are working on moving our
static site to S3 behind cloudfront, testing it now. Its down.

Our Airflow tasks that use S3, they are all down.

Basically, we are down :)

~~~
scrollaway
Unrelated to the downtime but what do you use Airflow for? We've been
considering using it for our ETL, but I'm even wondering if we could replace
our Jenkins instance with it (Jenkins currently only runs cronjobs and one-off
tasks; for CI we use travis).

~~~
chosenken
We use airflow mainly for ETL work, backfills, and batch processing. We do a
lot of work with clickstream type data, be it taking data from analytics.js
and loading it into redshift, or taking analytic data from redshift and
loading into Google analytics. We have developed many pipes and connectors for
airflow that allow us to connect to many data sources, both at the source and
sink ends. I mainly work on the DevOps team, running our infrastructure and
working on back end systems, so my knowledge of airflow is more high level. I
just keep the system up so it can run ;P FYI I work for a startup out of
Cincinnati, OH called Astronomer, you can find us at
[https://www.astronomer.io](https://www.astronomer.io)

Also, we weren't like all down, we just saw lots of time out issues when
reading/writing to S3.

~~~
scrollaway
Your site's ssl cert is not setup correctly :)

Very interesting re redshift => google analytics though, I've never heard of
it done in that direction.

Do you think airflow is suitable for one-off/cron task management as well?

~~~
chosenken
Actually yes. Airflow has cron support baked in, when you run a task you can
give it a cron schedule. It then takes care of running your tasks when
scheduled.

As for the ssl, I messed up my link. We are in the process of moving the site,
so some of the redirects are not working perfect at the moment >.>

------
rabidonrails
New update from AWS: 12:21 PM PDT We can confirm that some customers are
receiving throttling errors accessing S3. We are currently investigating the
root cause.

------
veb
Is Slack using S3? I can't seem to upload anything.

~~~
acejam
Yes, Slack uses S3 for attachments and shared items.

------
jaredstenquist
Unable to deploy with Elastic Beanstalk currently. Internal 500 error. They
put up the Blue Diamond icon, so this must be serious.

~~~
baq
knowing them the datacenter must be flooding with lava.

------
klinskyc
Thought I was going crazy. We're seeing 'Please reduce your request rate.' for
a lot of our requests as well

------
imrehg
Definitely seems like that... And we should back off and reduce our request
rates, HN, otherwise we'll make it worse!

~~~
rficcaglia
I tried CTRL+ALT+DEL but it didn't help!! :)

~~~
drusepth
Turning it off and back on again as we speak.

~~~
danso
Before the popularity of SSD in the average computer, and the theoretically
low downtimes for AWS, telling someone to just restart their computer a few
times until the website shows up might actually have "fixed" it.

------
barillax
We're seeing the same thing. Been tracking it via Twitter for the last 15
minutes or so:
[https://twitter.com/search?q=s3%20slow%20down](https://twitter.com/search?q=s3%20slow%20down)

------
philhartmanonic
I've gotten a lot of those errors while using the AWS CLI for S3, but it was
intermittent. I was using the sync command, so I just kept kicking it back off
again as soon as it finished, and eventually got everything up there.

------
fernandopj
From status page:
([https://status.aws.amazon.com](https://status.aws.amazon.com))

11:58 AM PDT We are investigating increased error rates for Amazon S3 requests
in the US-EAST-1 Region.

~~~
fernandopj
12:21 PM PDT We can confirm that some customers are receiving throttling
errors accessing S3. We are currently investigating the root cause.

------
xfax
Does anyone else have some of their buckets missing from the console list?

~~~
quickConclusion
Our buckets are there, just we cannot GET content, with error messages about
reducing the rate.

------
runesoerensen
_" 12:49 PM PDT We are now seeing recovery in the throttle error rates
accessing Amazon S3. We have identified the root cause and have taken actions
to prevent recurrence."_

------
devhead
looks like s3 apis are starting to respond correctly now.

hopefully this raises awareness on how important planning for failure is
before you make a design choice to introduce a dependency.

------
parthdesai
Yup, it is down for our site. Some of the images are not being loaded. (US-
east-1)

Console says: Failed to load resource: the server responded with a status of
503 (Slow Down)

------
coreyw
We are seeing increased error rates for GET and HEAD requests. So far we have
not been having any issues with PUT requests. We are in us-east-1.

------
enkay
Not sure if related but had unusual SES timeouts and "too many requests"
errors around the same time this was posted.

------
toddwprice
Down for us as well. `An error occurred with the message 'Please reduce your
request rate.' when writing an object`

------
hyperanthony
Confirmed this is also impacting CodeDeploy in us-east-1, makes sense since it
has S3 dependency on revision location.

------
1001101
I can see our us-east-1 buckets, and just pulled a file out. WFM, FWIW. IAM
got me a few weeks ago though. :(

------
crgwbr
We're see 'Please reduce your request rate' frequently on read and write
operations.

------
stormcode
It's effecting heroku deploys that rely on build-packs hosted on S3 (like ruby
build-pack).

~~~
insomniacity
That's really disappointing - why isn't something so critical to heroku
deployments distributed across multiple AZs and regions?

------
maccam912
"Slow Down"s for me. We started seeing some weird S3 behavior since about
11:45 PST.

------
redthrowaway
We're getting 503s on s3 assets. Definitely having issues. Cloudfront is still
good.

------
xfax
Does anyone else have some of their buckets missing from the console list?

------
klinskyc
Seeing the "reduce your requests" error message on our end

------
bdibs
Everything seems to be running normally again, at least for me.

------
scrollaway
[https://news.ycombinator.com/item?id=15251374](https://news.ycombinator.com/item?id=15251374)

------
tuna
[https://www.whoownsmyavailability.com/](https://www.whoownsmyavailability.com/)

------
sotojuan
Happening here too (NYC, us-east-1).

------
sgl75
Seems to be resolved.

~~~
daxorid
Status page claims resolution, but errors are still ongoing for us.

------
pierrebeaucamp
Seems like Github might be affected as well? Can't comment or merge PRs.

~~~
pm90
Github isn't hosted on AWS btw.

~~~
jwilk
At least downloads are hosted on S3.

------
cordite
Can't receive orders :D

------
stuffaandthings
yup

EB interface is affected too

------
vacri
I wish that AWS would settle on a standard timezone (preferably UTC).
Troubleshooting the fallout had me mentally converting their PDT status pages
with console graphs in both UTC and 'local browser' time[1]. All for a region
located in EDT.

[1] I think they even have a graph somewhere whose axes are in UTC, but whose
tooltips are in local browser time, but I can't recall for sure.

------
jaysunn
We are seeing errors in US-EAST:

error: S3ServiceException:Please reduce your request rate.,Status 503,Error
SlowDown,Rid

------
lasermike026
Oh boy.

------
CityLims
Russia? China? Jealous of our cat pics much?!

------
Achshar
I am having trouble on my vultr VPN as well. Any chance it's related?

~~~
namidark
No

------
rficcaglia
Definitely affecting S3 bucket ops, static site hosting on s3, and even things
like CloudTrail. Anything that uses S3 East VA my guess. Rather severe impact.
Maybe less focus on avocados is what's called for...

