
Dodging S3 Downtime with Nginx and HAProxy - zeeg
https://blog.sentry.io/2017/03/01/dodging-s3-downtime-with-nginx-and-haproxy.html
======
nodesocket
Heads up a simple yet production ready NGINX location block to proxy to a
public s3 bucket looks like:

    
    
        # matches /s3/*
        location ~* /s3/(.+)$ {
            set $s3_host 's3-us-west-2.amazonaws.com';
            set $s3_bucket 'somebucketname'
    
            proxy_http_version 1.1;
            proxy_ssl_verify on;
            proxy_ssl_session_reuse on;
            proxy_set_header Connection '';
            proxy_set_header Host $s3_host;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header Authorization '';
            proxy_hide_header x-amz-id-2;
            proxy_hide_header x-amz-request-id;
            proxy_buffering on;
            proxy_intercept_errors on;
            resolver 8.8.4.4 8.8.8.8;
            resolver_timeout 10s;
            proxy_pass https://$s3_host/$s3_bucket/$1;
        }
    

Adding NGINX caching on-top of this is pretty trivial.

Also, heads up, in the directive proxy_cache_path, they should consider
enabling "use_temp_path". This directive instructs NGINX to write them to the
same directories where they will be cached. We recommend that you set this
parameter to off to avoid unnecessary copying of data between file systems.
use_temp_path was introduced in NGINX version 1.7.10 and NGINX Plus R6.

    
    
        use_temp_path=off
    

Also, they should enable "proxy_cache_revalidate". This saves on bandwidth,
because the server sends the full item only if it has been modified since the
time recorded in the Last-Modified header.

    
    
        proxy_cache_revalidate on;

~~~
mattrobenolt
In our case, we don't need to ever revalidate. We store things forever since
our file blobs are immutable.

~~~
bpicolo
Immutable blobs are really the right choice with s3, as it's eventually-
consistent (when using it as a blob store anyway. If you're hosting a static
site or similar it's a bit tricky to immutableize and not necessarily worth
the effort).

~~~
mattrobenolt
We even go a step further, and our blobs are 100% content addressable. :) So
caching is super easy for us.

~~~
bpicolo
Yep, file shas are a great choice. UUIDs are typically fine too.

One sort of weird case is if I have an image key (sha-based) and want to store
thumbnail sizes: 'bae6ff187e4c491e5de9cfa3b039ce7da8255798' makes sense as a
base key, but really I want bae6ff187e4c491e5de9cfa3b039ce7da8255798/400x400
for thumbnails rather than storing individual thumbnail shas, hah.

------
newobj
Alternate title: "Replacing S3 downtime for vastly greater amounts of your own
downtime"

What is the name for this phenomenon where folks think they can out-available
a thing that has multiple engineers singularly dedicated to nothing more than
its availability /and/ operation? It it just hubris? Surely there must be a
more clinical name.

~~~
mattrobenolt
Yeah, one of our goals was to not add a new/weaker point of failure.

We gracefully fall back to S3 directly if our cache server is down without a
hiccup. So there is no operational overhead of this additional cog. If the
server has a failure, we'd go back to slightly degraded performance by talking
across the country until we brought it back online.

~~~
newobj
The idea that it being a local proxy meaning there is no operational overhead
is a dangerous fallacy.

If that's an earnestly literal statement from you, then it means you simply
haven't encountered the failure modes that these kinds of set up are inclined
towards.

I've worked at several BigCo's, seen them all implement this pattern, and seen
every single one of them have fleet-wide outages due to these innocent "local
proxies".

Remember FB's 2-3 hour outage 2 years ago?

[https://www.facebook.com/notes/facebook-engineering/more-
det...](https://www.facebook.com/notes/facebook-engineering/more-details-on-
todays-outage/431441338919/)

It was /exactly/ this kind of "local proxy for higher availability/caching
over the downstream thing" that caused the outage.

~~~
mattrobenolt
Sure, there's definitely risk. I'm not asserting that it's literally 0 chance.
But this risk of this is also tied up with other things that leverage this
proxy. So it's not adding a new dependency or a new point of failure. If this
has a problem, we'll also have problems talking to other services in our
network.

And for what it's worth, I've definitely fucked this up in the past and caused
downtime as a result of a setup like this. The pros still outweigh the cons in
practice.

~~~
newobj
You can't add a new thing without adding a new point of failure. Every point
is a point of failure.

Who deploys the thing?

A human? God knows they can screw it up.

Automated deployment?

Well, that's how you get a simultaneous failure and total outage.

Automated incremental deployment?

Ok, slower road to total outage.

Automated incremental that will halt itself or rollback based on reliability
metrics?

Ok, getting there.

Wait, was the local proxy load tested?

Was it load tested when one of your data centers is down and everything is
doing 30% more work?

And on and on and on. It's all operational overhead, it's all ways to fail.

Can you tell I used to work in monitoring? Maybe I just have PTSD now. :P

~~~
mattrobenolt
> You can't add a new thing without adding a new point of failure. Every point
> is a point of failure.

Correct, but it's an existing process. So you're right, we could ship a
blatantly bad config.

> Who deploys the thing?

We do, humans, yes. We can definitely screw up a config.

> Automated deployment?

We tend to do blue/green deploys on critical pieces of infrastructure just to
sanity check it. We might even pull a node out of production, test on a
staging server, etc.

> Wait, was the local proxy load tested?

Yes. The load we need for this case is not even close to significant.

> Was it load tested when one of your data centers is down and everything is
> doing 30% more work?

Yes, it's literally just a proxy to S3 doing no additional work. For our
traffic, the load is not a concern. Especially since it's running on every
machine, it's distributed pretty well. A single box cannot overload our
haproxy process compared to the CPU needed to run the Python application
itself.

> Can you tell I used to work in monitoring? Maybe I just have PTSD now. :P

DataDog, it's pretty dope. It gives us lots of super good insight into all of
these things and is what alerted us because haproxy reported S3 down in the
first place. It'd also tell is the moment a process like this crashes, etc.

~~~
newobj
DataDog was my customer...

------
jlgaddis
Lots of criticism in this thread by folks who are missing the point.

@mattrobenoit -- neat idea and the 70% savings in bandwidth is awesome. The
side effect of helping mitigate the S3 issue for you was a sweet little bonus!

~~~
mattrobenolt
Also, should note, it's about a 98% savings in bandwidth, but cost savings
explicitly was 70% since we have to factor in the cost of running this new
server.

------
koolba
> This proxy service had been running for a week while we watched our
> bandwidth and S3 bill drop, but we had an unexpected exchange yesterday
> morning: [pic showing S3 offline]

Talk about great timing!

------
asher_
There have been a lot of articles and suggestions for mitigating S3's single-
region architecture (design flaw?) since the outage.

One solution that I haven't seen much of is to just use a service that gives
you multi-region without any extra work, such as
[https://cloud.google.com/storage/docs/storage-
classes#multi-...](https://cloud.google.com/storage/docs/storage-
classes#multi-regional)

~~~
mattrobenolt
This doesn't solve our performance issues that I originally set out to
address.

------
jaymichael
For object sync between on-premise and S3, AWS offers Storage Gateway. It
supports IAM, encryption, bandwidth optimization, and local cache for hot
objects with availability built in.

[https://aws.amazon.com/storagegateway/details/](https://aws.amazon.com/storagegateway/details/)

~~~
mattrobenolt
I'm afraid to know how much that costs.

~~~
jaymichael
Pricing is here:
[https://aws.amazon.com/storagegateway/pricing](https://aws.amazon.com/storagegateway/pricing)

As with most services, you pay as you go and only for what you use. The prices
are aligned with typical AWS storage, request volume, and data transfer costs.

------
matt_wulfeck
If you're making range requests against S3 objects, look no further than
Varnish as a reverse S3 proxy. More details on accomplishing this here:

[https://moz.com/devblog/how-to-cache-http-range-
requests/](https://moz.com/devblog/how-to-cache-http-range-requests/)

------
StreamBright
This is actually a great idea. I like datadoge on the pic. What sort of bot is
that? Does anybody know?

~~~
mattrobenolt
That's DataDog. :)

------
openasocket
This is a neat way to deal with it, and has numerous other benefits, but I
thought I'd add another thing to try: deploying your data in multiple regions.
You can set up a secondary bucket in a secondary region, and configure your
primary bucket to replicate data to the secondary bucket automatically. And
then set up your infrastructure to switch to the secondary bucket for read
operations should there be a problem with the primary. With large amounts of
data the costs could add up, but it has the advantage that you can still serve
all read requests during a regional outage, not just the reads that are in the
cache.

Obviously this method has cost associated with it, so you should probably only
do this if you need complete data availability.

~~~
Florin_Andrei
> _configure your primary bucket to replicate data to the secondary bucket
> automatically_

Well, when the primary goes down, your write operations would get busted.

If the upstream source allows it, just write to both buckets at once. E.g.
with Logstash this is trivial.

~~~
openasocket
I put this elsewhere in the thread, but you can also set up bi-directional S3
replication. I haven't used it in production, but in theory it would mean you
can continue write operations during failure. And those writes that are
committed to the primary but not replicated when the region goes down wouldn't
be lost, they'd come back up once recovery is complete (S3's SLA for data
integrity is a lot stricter than it's uptime). Whether or not that is
acceptable depends a lot on your use case.

------
movedx
Have you considered using S3 bucket replication and having your application's
logic fail over to the replication target in the event of a regional failure?
The former is a checkbox, the latter is 30 minutes of coding (in my
experience.)

~~~
mattrobenolt
S3 bucket replication is a bit flawed, and doesn't buy us anything on top of
what we implemented. And would cost double for storage. Plus more complex
application logic which I wanted to avoid. Tools like haproxy are pretty good
at this.

With S3 replication, you still have a primary/replica setup in which only one
of them can accept writes, but you can accept reads from both. So we'd gain HA
between multiple regions, but we wouldn't solve our original goals: speed. The
round trip to S3 was too slow for us.

~~~
movedx
Very cool. I did read the article but I missed that speed was important, so
the custom solution was an excellent choice.

Another idea might be to use Varnish for the caching layer, but I haven't
compared Varnish to NginX in many years so the gap has probably been closed
now?

Good work. I've stuffed this one in the back pocket for future use.

~~~
mattrobenolt
I have tons of experience with Varnish and a long history there. Varnish is
really bad for this since it's memory only. I wanted to use 750GB of disk
space, not the 32GB of RAM we had. 750GB of RAM is significantly more
expensive.

And in our case, performance between reading some bytes from disk vs memory
isn't significant. A disk seek is still many many orders of magnitude faster
than a round trip to Amazon.

With that said, Varnish does offer the ability to use mmaped files, but the
performance is really appalling out of the box, and just not worth it. Varnish
is way better if you want strictly in memory cache.

Another benefit of nginx is the cache won't be dumped if the process restarts,
unlike Varnish.

------
mnutt
One thing that complicates easy replication is the rules around CNAMEing your
S3 bucket to a subdomain you control. You have to name the bucket
subdomain.your-domain.com.s3.amazonaws.com, then CNAME subdomain to
s3.amazonaws.com. Your replicated bucket needs a unique name, so when you want
to fail over it's unfortunately not just a matter of changing a DNS entry to
point to the alternate bucket.

Of course, if you have complete control of the client you can just change the
hostname. But if you have references scattered throughout html files, you'll
likely want a reverse proxy in front of S3.

------
c17r
I've built similar proxy/cache setups with just nginx.

Is HAProxy there to get more insight into things, like the slack notification?
Or does it serve another purpose?

~~~
mattrobenolt
It directs traffic conditionally based on if our cache server is up or not,
falling back to S3 directly. In theory, we could use nginx for this, but nginx
isn't as good as a generic load balancer since it doesn't have as rich of
insight into the status of upstreams. Also yes, the nice benefit of being able
to do alerting based on uptimes and whatnot. We already have haproxy running
here, so it was just adding another frontend to support this.

~~~
c17r
Thanks for the info.

Is the next step locally cache uploads so that works while S3 is down?

~~~
mattrobenolt
That's another problem that'd require much more work than just a few hours of
hacking around. :)

------
sfeng
Another good option is to store your files in a second cloud storage service,
and use CDN or DNS failover.

~~~
mattrobenolt
This would unfortunately be slower, since it's not within our datacenter and
on our network, and significantly more expensive. Probably at least 100x more
expensive for the bandwidth than S3 by itself.

------
Hydraulix989
Cloud services like S3 are making programmers too complacent these days.

Back in the day, every production web site had something like this "HAProxy"
in place.

Now, S3 goes down, and everybody has their thumbs up their asses because that
wasn't covered in the Rails bootcamp they went to.

------
krishnasrinivas
If the client sends incorrect Authorization you will still serve from the
cache. This is insecure.

~~~
mattrobenolt
This is a trusted, internal, private network. The only one who could do this
is the application itself, or something rogue on our network. If something
were running rogue on our network, there'd be worse things it could get access
to.

~~~
krishnasrinivas
I see, if this is private network then this is a nice simple solution for
caching. We plan to implement S3-caching in minio
[[https://minio.io](https://minio.io)] (i.e it will authenticate the requests
and also do caching) in case you'd be interested for public facing caching
proxies.

~~~
mattrobenolt
Yep, it's definitely possible to go this route as well. We just didn't have
to.

------
AlexCoventry
How does authorization and access control interact with the proxy? Does it
first check authorization with S3 and cache the result, use a parallel ACL, or
just allow access to anything by anyone?

~~~
mattrobenolt
Good question. Authorization is passed along upstream to S3, but we don't re-
check authorization when serving a cache hit. In our case, this is a fine
tradeoff since our network is private and trusted.

~~~
mattrobenolt
To be more clear, there are other things we could do here if we didn't
inherently trust our network and the things running there.

~~~
AlexCoventry
Thanks, I understand.

------
cagataygurturk
Neat. You should be saving bandwidth costs also.

~~~
mattrobenolt
As mentioned, we are saving a lot in S3 transfer cost. :)

------
mixedbit
Can you disclose which provider do you use to host the servers? Do you have a
fixed bandwidth or pay per GiB of transfer?

~~~
mattrobenolt
SoftLayer. All internal bandwidth is free. Public network is a fixed
allocation of bytes per server, then overages.

------
noway421
Essentially a fluke and not a general case people should follow.

~~~
mattrobenolt
Disagree. Considering the original goal had nothing to do with expecting S3 to
go down, it just happened to be super useful during this incident. We get more
more benefits even without S3 breaking.

~~~
cakoose
OP is a little abrupt, but it isn't downplaying the system's usefulness to you
guys; just warning other readers that this isn't a general solution for, as
the post is titled, "dodging S3 downtime". It's a "fluke" precisely because it
"just happened" to help with something it wasn't designed for.

It's not uncommon for applications to have MRU access patterns and be able to
keep partially functioning during partial data availability. For these
applications, a cache will lower costs and mitigate S3 outages. It would have
been nice for the article to give the criteria up front.

~~~
mattrobenolt
Is paragraph 2 of the post not upfront enough? I clearly stated our goals of
the project, and the fact that this also saves us through the incident was a
great side effect. But I never set out to mitigate S3 failures like this.

~~~
cakoose
> Is paragraph 2 of the post not upfront enough? I clearly stated our goals of
> the project

That paragraph says your connection to S3 is slow. Your solution doesn't fix
general S3 slowness. However, it does make S3 slowness less of a problem for
your application.

That's why it would have been useful to describe how your application accesses
S3, so others can quickly determine if their applications would also benefit
from a similar solution.

For example: "90% of our S3 reads are of blobs that were read in the last 30
days."

Many applications might upload data to S3, then quickly download it for
processing. For those applications, this solution won't work.

> But I never set out to mitigate S3 failures like this.

We know. It's just that some people might read the title and think this is a
more general solution than it is. OP was just warning people against that.

