Dodging S3 Downtime with Nginx and HAProxy

nodesocket · on March 1, 2017

Heads up a simple yet production ready NGINX location block to proxy to a public s3 bucket looks like:

    # matches /s3/*
    location ~* /s3/(.+)$ {
        set $s3_host 's3-us-west-2.amazonaws.com';
        set $s3_bucket 'somebucketname'

        proxy_http_version 1.1;
        proxy_ssl_verify on;
        proxy_ssl_session_reuse on;
        proxy_set_header Connection '';
        proxy_set_header Host $s3_host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Authorization '';
        proxy_hide_header x-amz-id-2;
        proxy_hide_header x-amz-request-id;
        proxy_buffering on;
        proxy_intercept_errors on;
        resolver 8.8.4.4 8.8.8.8;
        resolver_timeout 10s;
        proxy_pass https://$s3_host/$s3_bucket/$1;
    }

Adding NGINX caching on-top of this is pretty trivial.

Also, heads up, in the directive proxy_cache_path, they should consider enabling "use_temp_path". This directive instructs NGINX to write them to the same directories where they will be cached. We recommend that you set this parameter to off to avoid unnecessary copying of data between file systems. use_temp_path was introduced in NGINX version 1.7.10 and NGINX Plus R6.

    use_temp_path=off

Also, they should enable "proxy_cache_revalidate". This saves on bandwidth, because the server sends the full item only if it has been modified since the time recorded in the Last-Modified header.

    proxy_cache_revalidate on;

wcdolphin · on March 2, 2017

Noo!!!

This is vulnerable to path expansion attacks. If someone passes a URL such as your site/s3/..EVIL_BUCKET/EVIL.js all of a sudden your site is serving someone else's content. Bad idea. Use virtual host style buckets instead, i.e S3_bucket.S3host/content.

mattrobenolt · on March 1, 2017

In our case, we don't need to ever revalidate. We store things forever since our file blobs are immutable.

bpicolo · on March 2, 2017

Immutable blobs are really the right choice with s3, as it's eventually-consistent (when using it as a blob store anyway. If you're hosting a static site or similar it's a bit tricky to immutableize and not necessarily worth the effort).

mattrobenolt · on March 2, 2017

We even go a step further, and our blobs are 100% content addressable. :) So caching is super easy for us.

bpicolo · on March 2, 2017

Yep, file shas are a great choice. UUIDs are typically fine too.

One sort of weird case is if I have an image key (sha-based) and want to store thumbnail sizes: 'bae6ff187e4c491e5de9cfa3b039ce7da8255798' makes sense as a base key, but really I want bae6ff187e4c491e5de9cfa3b039ce7da8255798/400x400 for thumbnails rather than storing individual thumbnail shas, hah.

sandGorgon · on March 2, 2017

or.. use cloudfront. It will probably be much cheaper to use cloudfront than the instance scaling required as your traffic increases.

eli · on March 2, 2017

Is CloudFront expected to have better uptime than s3?

One argument for self hosting the proxy is that I don't care if s3 is working when my server is down anyway.

paulddraper · on March 2, 2017

CloudFront, as with most CDNs, has very good uptime.

Remove the C in CAP and you can go far.

heipei · on March 2, 2017

Good config, but if you're not defining the proxy in an upstream {} block, you can't make use of the keepalive parameter, which keeps a number of connections to the backend alive at any time, reducing the RTT for an actual request.

mattrobenolt · on March 2, 2017

This is bad for stuff like this because nginx doesn't re-resolve DNS records after process startup. So if an IP address behind the hostname changes, things will just hard stop working. Using it explicitly as a variable coerces nginx into actually resolving DNS regularly to pick up changes like a normal client.

newobj · on March 1, 2017

Alternate title: "Replacing S3 downtime for vastly greater amounts of your own downtime"

What is the name for this phenomenon where folks think they can out-available a thing that has multiple engineers singularly dedicated to nothing more than its availability /and/ operation? It it just hubris? Surely there must be a more clinical name.

Sanddancer · on March 1, 2017

Caching proxies are old hat. Things like squid have existed pretty much the entire lifespan of the web. Keeping things close to your servers means you don't have the inherent delay of pulling things from a remote server, regardless of how fast amazon makes things.

Additionally, not keeping all your eggs in Amazon's basket means you're not SOL when they have a datacenter hosting all your content go down. It also means that if and when a service that better fits your needs comes along, you are more readily able to migrate without problems.

Finally, site reliability is not something that takes a team the size of Amazon's. A lot of the things that are required for availability on AWS -- redundant systems providing services, standbys, etc -- are things that sysadmins were doing before AWS was extant. AWS' biggest gift to reliability is that its instances are less stable than most dedicated servers; you're taught from day one not to rely on a single server, so you build it right the first time.

So no, it's not hubris. It's calculating price/performance, it's applying things you're probably already doing to a new problem, and figuring out what the best solution really is, which rarely involves just throwing money at Amazon.

mattrobenolt · on March 1, 2017

Yeah, one of our goals was to not add a new/weaker point of failure.

We gracefully fall back to S3 directly if our cache server is down without a hiccup. So there is no operational overhead of this additional cog. If the server has a failure, we'd go back to slightly degraded performance by talking across the country until we brought it back online.

newobj · on March 1, 2017

The idea that it being a local proxy meaning there is no operational overhead is a dangerous fallacy.

If that's an earnestly literal statement from you, then it means you simply haven't encountered the failure modes that these kinds of set up are inclined towards.

I've worked at several BigCo's, seen them all implement this pattern, and seen every single one of them have fleet-wide outages due to these innocent "local proxies".

Remember FB's 2-3 hour outage 2 years ago?

https://www.facebook.com/notes/facebook-engineering/more-det...

It was /exactly/ this kind of "local proxy for higher availability/caching over the downstream thing" that caused the outage.

mattrobenolt · on March 1, 2017

Sure, there's definitely risk. I'm not asserting that it's literally 0 chance. But this risk of this is also tied up with other things that leverage this proxy. So it's not adding a new dependency or a new point of failure. If this has a problem, we'll also have problems talking to other services in our network.

And for what it's worth, I've definitely fucked this up in the past and caused downtime as a result of a setup like this. The pros still outweigh the cons in practice.

newobj · on March 1, 2017

You can't add a new thing without adding a new point of failure. Every point is a point of failure.

Who deploys the thing?

A human? God knows they can screw it up.

Automated deployment?

Well, that's how you get a simultaneous failure and total outage.

Automated incremental deployment?

Ok, slower road to total outage.

Automated incremental that will halt itself or rollback based on reliability metrics?

Ok, getting there.

Wait, was the local proxy load tested?

Was it load tested when one of your data centers is down and everything is doing 30% more work?

And on and on and on. It's all operational overhead, it's all ways to fail.

Can you tell I used to work in monitoring? Maybe I just have PTSD now. :P

mattrobenolt · on March 1, 2017

> You can't add a new thing without adding a new point of failure. Every point is a point of failure.

Correct, but it's an existing process. So you're right, we could ship a blatantly bad config.

> Who deploys the thing?

We do, humans, yes. We can definitely screw up a config.

> Automated deployment?

We tend to do blue/green deploys on critical pieces of infrastructure just to sanity check it. We might even pull a node out of production, test on a staging server, etc.

> Wait, was the local proxy load tested?

Yes. The load we need for this case is not even close to significant.

> Was it load tested when one of your data centers is down and everything is doing 30% more work?

Yes, it's literally just a proxy to S3 doing no additional work. For our traffic, the load is not a concern. Especially since it's running on every machine, it's distributed pretty well. A single box cannot overload our haproxy process compared to the CPU needed to run the Python application itself.

> Can you tell I used to work in monitoring? Maybe I just have PTSD now. :P

DataDog, it's pretty dope. It gives us lots of super good insight into all of these things and is what alerted us because haproxy reported S3 down in the first place. It'd also tell is the moment a process like this crashes, etc.

newobj · on March 2, 2017

DataDog was my customer...

newobj · on March 1, 2017

(To be clear, I'm not saying it's an anti-pattern, I'm just saying that calling it "no operational overhead" is naive)

zoeysaurusrex · on March 2, 2017

Mince words all you want, they stayed operational when many others did not. That's success in my book.

soft_dev_person · on March 2, 2017

For an increased risk of going down when everybody else is up. Is that still success? And at what other costs?

It all comes down to risk vs. cost vs. gain.

arrty88 · on March 1, 2017

You are assuming your haproxy server is always operational, no?

mattrobenolt · on March 2, 2017

It's running on localhost, so it's not it's own machine. It's local to the servers running the application code.

wimagguc · on March 1, 2017

Maybe I missed this in the docs, but why isn't HAProxy considered a new point of failure?

mattrobenolt · on March 1, 2017

It's running on localhost to each server. So the failure event here is that somehow haproxy process would explode with the rest of the server being fine. It's much much more likely that a whole machine will die instead or a network issue between machines, etc.

wimagguc · on March 1, 2017

Sure, that's well understood. Being a low-risk point of failure, isn't it still a new one? It does come with setup and maintenance costs, test scenarios etc., so it's only fair to recognise this as a risk.

mattrobenolt · on March 1, 2017

Technically yes. But we're pretty accustomed to this level of risk. For something like this, the pros far outweigh the cons involved. Yeah, it could fail. The maintenance overhead of this is absolutely minimal and took a handful of hours to have tested and in production.

Also worth noting, that this isn't really a single point of failure as a system wide thing. It'd only be a single point of failure on that single node. So if haproxy decided to explode, only that one machine would have a problem momentarily, while the process got started back up with our process manager.

The worst case scenario is a human error where we ship a bad config and break everything.

greenleafjacob · on March 2, 2017

Not really true. If you for example mistune maxconn haproxy will stop accepting new connections and that's likely to happen cluster wide.

mattrobenolt · on March 2, 2017

This is equivalent to shipping bad application code that takes everything down. Except the config is only a handful of lines of code and will very likely never change again. Also, we don't blindly roll out changes cluster wide for things like this without testing explicitly on staging or test nodes.

datums · on March 2, 2017

Shipping it with the app, you lose the cluster wide cached objects. A SPOF is the resolver. It's google but it's a SPOF. Is the failover to s3 automatic ? Or do you make a code change ? What kind of latency does that add ?

nalllar · on March 2, 2017

Haproxy is localhost. Caching nginx is nearby but not local, so the cache is shared.

Haproxy sends to caching nginx if available, else directly to s3.

datums · on March 2, 2017

Thx. Got it. I don't know enough about the app, but I would have it serve directly from CF to users. Instead of hitting this environment for static assets. Good job and good conversation.

MichaelRenor · on March 1, 2017

Don't worry, they run HAproxy in front of haproxy incase the haproxy to S3 service goes down.

justinsaccount · on March 1, 2017

> Each application server that’s running our Sentry code has an HAProxy process running on localhost.

Klathmon · on March 1, 2017

I don't necessarily agree with it, but the common response to this is all about timing.

With AWS you don't have any control over "higher risk" times. If you have a massive launch coming up, or you are nearing peak usage for the year, or your clients need you to be stable for the next few months, you can't put updates on hold, you don't know to get a few more people on standby, you can't choose to not make changes to your system, because it's not your system.

With an in house solution you can choose to lock it down for a month, or do the risky upgrades/changes at your lowest traffic time, or even give your customers a heads up if needed. Hell even just being able to mak e sure that your best sysadmin isn't out getting hammered when you go to make changes could go a long way.

There is some merit to that idea, but I personally feel the track record of many of these services is so near perfect that the chances of unexpected downtime is still smaller than most could realistically manage.

eropple · on March 1, 2017

Maybe I'm reading this wrong, but to me it looks like their solution has a strict positive impact on reliability unless you are concerned about the local HAProxy node causing a problem. (Local to the running service, that is--it looks like it's on the same box?) It caches or falls through as appropriate, does it not?

mattrobenolt · on March 1, 2017

This is correct assessment. HAProxy is running on localhost, and it strictly falls back to hitting public S3 directly if our cache is down.

qaq · on March 2, 2017

It's called real world our clients running on dedicated hardware at 2 dcs have consistently less issues than those on AWS. The aws control layer and infrastructure is too complex and results in fairly significant outages.

CodingGuy · on March 2, 2017

That's it. I'm running my own dedicated servers for my business and had zero (ZERO!) downtime in the last three years. How much downtime had the big cloud players?

meirelles · on March 2, 2017

Me too. I've been renting servers for over 10 years. By the way is fairly common DCs with very high uptime (between 99.999% - 100% over 5 years or more), especially in cities with high connectivity like Ashburn/VA or Dallas/TX. Failures on servers with less than 4 years of use is _really_ unusual.

paulddraper · on March 2, 2017

Well, Dyn kinda sucked for some big players.

Do you host your own DNS?

grey-area · on March 2, 2017

Every time cloud services have an outage, this line of reasoning becomes less and less appealing.

Your analysis assumes all other factors are constant. Change causes downtime, cloud services have a high rate of change (constantly pushing new configs, new code), many other servers don't.

mnutt · on March 1, 2017

This incident lasted for three hours. The last incident was in 2015 and was also a significant amount of time. Sentry can likely provision a new HAProxy node in minutes.

Most importantly, while S3 is relatively stable it's a black box. If you really care about high availability, you want to bring all of the points of failure under your direct control. On the other hand if availability is just a nice-to-have, relying on S3 is probably a better use of time.

newobj · on March 1, 2017

> If you really care about high availability, you want to bring all of the points of failure under your direct control.

This is what exactly I'm talking about. People can't accept that trying to control it is not in any way guaranteed to make it more highly available. Bringing points of failure under your control makes no innate guarantee of improving anything. It can easily make it worse.

The only thing it can do is let you blame yourself of blaming S3.

You can only be hardened for what you have anticipated or experienced before.

There are tens thousands of wall clock hours of operational experience w/ S3. Availability is one of the top concerns of all AWS.

Thinking you can be more available is just fooling yourself. Believing you are more available will entail a willful ignorance or distortion of metrics.

Do your engineers carry pagers and have a <15 minute engagement time? Do your engineers sit at home when they are on-call because they know they can't simply let a page slide because they were in the middle of dinner? Or is your company more lenient than Amazon when it comes to operations?

Do your engineers spend a quarter fixing some failure mode of your infrastructure, or are they too busy working on features?

Is your team's performance measured by availability of your service? Or your actual core business?

mattrobenolt · on March 1, 2017

Ironically, in our case, our availability is very core to our business. In this exact scenario, if S3 would have blocked our processing pipeline because it was down, that means we couldn't have alerted users as reliably that they were in fact, having issues because of S3 being down as well. So in our case, this is massively important to us and worth any additional risk that may be introduced.

IAmGraydon · on March 1, 2017

Are you realizing that risk spread out over time is what creates your downtime average? So what you are saying is that your uptime is so important that it is worth additional downtime. It's an illogical statement.

mattrobenolt · on March 1, 2017

You should actually read the blog post. We are strictly increasing availability in addition to S3's already amazing availability. Not adding another point of failure and assuming we're better. In fact, I literally assume I'm shitty, so I build defenses so my fuck ups don't cause any issues.

mnutt · on March 1, 2017

I still think a lot of this depends on your availability requirements.

I don't think I could run an S3 service at the scale AWS does with higher reliability, but I do believe I can run a pool of redundant HAProxies with higher reliability, and in fact we already run pools of HAProxies for other reasons so we have quite a bit of operational experience with it.

If my company had three engineers I certainly would not go this route, but if you are big enough to have a dedicated ops team that already has experience with this sort of thing, you can architect something that is more reliable than just relying on S3 alone.

IAmGraydon · on March 1, 2017

Exactly. I'm sorry OP, but with all due respect, you're delusional if you think your strategy would improve upon AWS's availability. The only way you would do that is to deploy a solution like you mentioned on infrastructure that has higher availability than AWS. Hint: that doesn't exist. AWS may have low statistical availability for the month, but on any longer timescale they're still the very best in the game. You need to remember that whatever caused this was a black swan event.

mnutt · on March 1, 2017

I disagree. In the last few years S3 has had multiple of these black swan events, while the reverse proxies I am talking about have pushed through hundreds of billions of responses and have had significantly fewer incidents. (In this case, none)

I think the fallacy here is that you're not comparing apples to apples: I would be the last to argue that I could run a globally distributed S3 competitor better than Amazon. But I can (and have) run a massively simpler service with better overall uptime because it increases our options during upstream black swan events.

newobj · on March 2, 2017

"But I can (and have) run a massively simpler service with better overall uptime because it increases our options during upstream black swan events."

I'm not saying it's impossible. But I am saying it's dangerous to omit from this conversation the idea that introducing the very --point of option-- can cause worse reliability than just using the downstream thing in the first place.

mnutt · on March 2, 2017

Sure, any new piece of infrastructure we add has the possibility for reducing reliability. We only introduce things we think we can support, and that add value for the company. YMMV.

We also have less than 15 minute incident engagement times, and don't let important pages slide through dinner. It's totally standard ops stuff: if one of the servers is down, we'll replace it when we get around to it. If they're all down, pages are going off.

mattrobenolt · on March 1, 2017

You missed the major point where we're strictly improving and not thinking we're more available. In fact, I assume that we're less available, hence why we have a fallback from our local to S3 directly since I assume our single server is going to die more often than S3.

mnutt · on March 1, 2017

Honestly, think you've taken advice that is generally true ("service providers that focus on a problem can do it better than you") and applied it as dogma even when there are plenty of cases where it is flat out incorrect.

mattrobenolt · on March 1, 2017

This is mostly unrelated to our original goals, which was reliability and performance. This wasn't to protect ourselves in the event of S3 going down, the timing just worked out that it saved us during that as well while also accomplishing the original goals.

mnutt · on March 1, 2017

Ah, fortuitous timing! :)

If you were looking to protect yourself against S3 going down and had bigger drives, I assume you could use cron to sync the entire bucket and use `try_files` to prefer local and fall back to S3 if the file was missing?

mattrobenolt · on March 1, 2017

Yeah, we could do something like this as well if we cared. We are easily caching 95+% of our active data in a small amount of disk space. It's not that valuable for us to have a complete replica of the full data set.

mattrobenolt · on March 1, 2017

Sorry, and by reliable, I explicitly mean, the network connection from our datacenter to halfway across the country hiccuped more than I liked and was slower than I liked. Not reliability of the service itself.

openasocket · on March 1, 2017

If the caching server fails, the infrastructure automatically switches to use S3 directly.

patrickg_zill · on March 2, 2017

I think you have never run servers in a good datacenter before.

I know of a datacenter in Denver CO with 14 YEARS of uptime, for instance.

exclusiv · on March 2, 2017

I was thinking the same thing... but if your proxy goes down, couldn't you update code quickly at the app level to go direct to S3 then?

Whereas - if you just rely on S3 directly, if they have problems, there's not much you can do unless you also have all assets locally on your servers.

mattrobenolt · on March 2, 2017

We do this automatically already with HAProxy. So we don't even have to change our application.

MichaelRenor · on March 1, 2017

Yes! All of the time! When people talk about moving away to their own datacenter, I like to ask how many of the top engineers in the world will be on-call at any time to monitor it?

mattrobenolt · on March 1, 2017

Fortunately, I am top engineer and am on-call for what I implement and rely on. :) This is also why I build things in a way that don't add more risk to production infrastructure. If you'd read the post, you'd see that this is a strict improvement without risk on our side from introducing a new dependency.

_epin · on March 1, 2017

> What is the name for this phenomenon where folks think they can out-available a thing that has multiple engineers singularly dedicated to nothing more than its availability /and/ operation?

S3 is an object storage system, they're adding a proxy. It's pretty easy to make a proxy that has better uptime than S3 because it's far, far less complex.

Scaevolus · on March 1, 2017

S3 is an object storage system. EBS is block storage.

jlgaddis · on March 2, 2017

Lots of criticism in this thread by folks who are missing the point.

@mattrobenoit -- neat idea and the 70% savings in bandwidth is awesome. The side effect of helping mitigate the S3 issue for you was a sweet little bonus!

mattrobenolt · on March 2, 2017

Also, should note, it's about a 98% savings in bandwidth, but cost savings explicitly was 70% since we have to factor in the cost of running this new server.

mattrobenolt · on March 2, 2017

Thank you. <3

koolba · on March 1, 2017

> This proxy service had been running for a week while we watched our bandwidth and S3 bill drop, but we had an unexpected exchange yesterday morning: [pic showing S3 offline]

Talk about great timing!

asher_ · on March 2, 2017

There have been a lot of articles and suggestions for mitigating S3's single-region architecture (design flaw?) since the outage.

One solution that I haven't seen much of is to just use a service that gives you multi-region without any extra work, such as https://cloud.google.com/storage/docs/storage-classes#multi-...

mattrobenolt · on March 2, 2017

This doesn't solve our performance issues that I originally set out to address.

cavisne · on March 2, 2017

The problem with solutions like these is you see global outages, like googles global VM outage last year.

Also from the public description this sounds like one big system, ie the second "region" may not be a public Google Cloud region. Just as much chance of an outage,

jaymichael · on March 2, 2017

For object sync between on-premise and S3, AWS offers Storage Gateway. It supports IAM, encryption, bandwidth optimization, and local cache for hot objects with availability built in.

https://aws.amazon.com/storagegateway/details/

mattrobenolt · on March 2, 2017

I'm afraid to know how much that costs.

jaymichael · on March 2, 2017

Pricing is here: https://aws.amazon.com/storagegateway/pricing

As with most services, you pay as you go and only for what you use. The prices are aligned with typical AWS storage, request volume, and data transfer costs.

MichaelRenor · on March 1, 2017

If you're making range requests against S3 objects, look no further than Varnish as a reverse S3 proxy. More details on accomplishing this here:

https://moz.com/devblog/how-to-cache-http-range-requests/

StreamBright · on March 1, 2017

This is actually a great idea. I like datadoge on the pic. What sort of bot is that? Does anybody know?

mattrobenolt · on March 1, 2017

That's DataDog. :)

openasocket · on March 1, 2017

This is a neat way to deal with it, and has numerous other benefits, but I thought I'd add another thing to try: deploying your data in multiple regions. You can set up a secondary bucket in a secondary region, and configure your primary bucket to replicate data to the secondary bucket automatically. And then set up your infrastructure to switch to the secondary bucket for read operations should there be a problem with the primary. With large amounts of data the costs could add up, but it has the advantage that you can still serve all read requests during a regional outage, not just the reads that are in the cache.

Obviously this method has cost associated with it, so you should probably only do this if you need complete data availability.

mattrobenolt · on March 1, 2017

Or we can do what we did and save money instead and gain the guarantees. Plus achieve our original goal of performance by bringing the bytes inside our datacenter instead of going out through public internet.

openasocket · on March 1, 2017

Yeah, what I mentioned does nothing to improve performance, just thought I'd mention it as something to think about. For your use case S3 replication doesn't really give you extra availability. In practice I would likely use your caching method, and use S3 replication in addition, if I had a large dataset that doesn't cache well. Or if I absolutely needed to maintain write availability, in which case I would use bi-directional S3 replication.

mattrobenolt · on March 1, 2017

Definitely. If our dataset was absolutely massive and we couldn't hold a reasonable amount on disk, it'd make more sense. Fortunately, we are getting a 90+% hit ratio out of a very small amount of space relative to the size of our bucket.

And yeah, we have 0 resiliance for write data here. Again, fortunately, we can afford this tradeoff since the amount of uploads is significantly lower and much less critical for us.

Florin_Andrei · on March 1, 2017

> configure your primary bucket to replicate data to the secondary bucket automatically

Well, when the primary goes down, your write operations would get busted.

If the upstream source allows it, just write to both buckets at once. E.g. with Logstash this is trivial.

openasocket · on March 1, 2017

I put this elsewhere in the thread, but you can also set up bi-directional S3 replication. I haven't used it in production, but in theory it would mean you can continue write operations during failure. And those writes that are committed to the primary but not replicated when the region goes down wouldn't be lost, they'd come back up once recovery is complete (S3's SLA for data integrity is a lot stricter than it's uptime). Whether or not that is acceptable depends a lot on your use case.

derefr · on March 1, 2017

Sure, though in many of S3's use-cases you won't care about writes at all, just read availability. E.g. for S3 buckets serving as canonical binary-asset hosts for CDNs to front.

movedx · on March 1, 2017

Have you considered using S3 bucket replication and having your application's logic fail over to the replication target in the event of a regional failure? The former is a checkbox, the latter is 30 minutes of coding (in my experience.)

mattrobenolt · on March 1, 2017

S3 bucket replication is a bit flawed, and doesn't buy us anything on top of what we implemented. And would cost double for storage. Plus more complex application logic which I wanted to avoid. Tools like haproxy are pretty good at this.

With S3 replication, you still have a primary/replica setup in which only one of them can accept writes, but you can accept reads from both. So we'd gain HA between multiple regions, but we wouldn't solve our original goals: speed. The round trip to S3 was too slow for us.

movedx · on March 1, 2017

Very cool. I did read the article but I missed that speed was important, so the custom solution was an excellent choice.

Another idea might be to use Varnish for the caching layer, but I haven't compared Varnish to NginX in many years so the gap has probably been closed now?

Good work. I've stuffed this one in the back pocket for future use.

mattrobenolt · on March 1, 2017

I have tons of experience with Varnish and a long history there. Varnish is really bad for this since it's memory only. I wanted to use 750GB of disk space, not the 32GB of RAM we had. 750GB of RAM is significantly more expensive.

And in our case, performance between reading some bytes from disk vs memory isn't significant. A disk seek is still many many orders of magnitude faster than a round trip to Amazon.

With that said, Varnish does offer the ability to use mmaped files, but the performance is really appalling out of the box, and just not worth it. Varnish is way better if you want strictly in memory cache.

Another benefit of nginx is the cache won't be dumped if the process restarts, unlike Varnish.

openasocket · on March 1, 2017

nit: S3 has bi-directional replication, so both of the buckets can accept writes.

mattrobenolt · on March 1, 2017

I'll have to look into this. I wasn't aware. Either way, I don't think that'd replace our current setup since the original intent wasn't to increase our availability. But good to know!

user5994461 · on March 1, 2017

Or just use Google, their S3 has an option where it's replicate worldwide and available in all zones.

anthony_franco · on March 1, 2017

Was this setup useful during the S3 outage yesterday? Wanted to double check before I implement it.

mattrobenolt · on March 2, 2017

Yes. That was, in fact, the entire point of the blog post.

anthony_franco · on March 2, 2017

Right, but I was referring to the comment regarding the S3 bucket replication.

mattrobenolt · on March 2, 2017

Ah, I can't comment about that since we obviously don't use it, but in theory, yes.

mnutt · on March 1, 2017

One thing that complicates easy replication is the rules around CNAMEing your S3 bucket to a subdomain you control. You have to name the bucket subdomain.your-domain.com.s3.amazonaws.com, then CNAME subdomain to s3.amazonaws.com. Your replicated bucket needs a unique name, so when you want to fail over it's unfortunately not just a matter of changing a DNS entry to point to the alternate bucket.

Of course, if you have complete control of the client you can just change the hostname. But if you have references scattered throughout html files, you'll likely want a reverse proxy in front of S3.

c17r · on March 1, 2017

I've built similar proxy/cache setups with just nginx.

Is HAProxy there to get more insight into things, like the slack notification? Or does it serve another purpose?

mattrobenolt · on March 1, 2017

It directs traffic conditionally based on if our cache server is up or not, falling back to S3 directly. In theory, we could use nginx for this, but nginx isn't as good as a generic load balancer since it doesn't have as rich of insight into the status of upstreams. Also yes, the nice benefit of being able to do alerting based on uptimes and whatnot. We already have haproxy running here, so it was just adding another frontend to support this.

c17r · on March 1, 2017

Thanks for the info.

Is the next step locally cache uploads so that works while S3 is down?

mattrobenolt · on March 1, 2017

That's another problem that'd require much more work than just a few hours of hacking around. :)

sfeng · on March 1, 2017

Another good option is to store your files in a second cloud storage service, and use CDN or DNS failover.

mattrobenolt · on March 1, 2017

This would unfortunately be slower, since it's not within our datacenter and on our network, and significantly more expensive. Probably at least 100x more expensive for the bandwidth than S3 by itself.

Hydraulix989 · on March 3, 2017

Cloud services like S3 are making programmers too complacent these days.

Back in the day, every production web site had something like this "HAProxy" in place.

Now, S3 goes down, and everybody has their thumbs up their asses because that wasn't covered in the Rails bootcamp they went to.

krishnasrinivas · on March 1, 2017

If the client sends incorrect Authorization you will still serve from the cache. This is insecure.

mattrobenolt · on March 1, 2017

This is a trusted, internal, private network. The only one who could do this is the application itself, or something rogue on our network. If something were running rogue on our network, there'd be worse things it could get access to.

krishnasrinivas · on March 1, 2017

I see, if this is private network then this is a nice simple solution for caching. We plan to implement S3-caching in minio [https://minio.io] (i.e it will authenticate the requests and also do caching) in case you'd be interested for public facing caching proxies.

mattrobenolt · on March 1, 2017

Yep, it's definitely possible to go this route as well. We just didn't have to.

AlexCoventry · on March 1, 2017

How does authorization and access control interact with the proxy? Does it first check authorization with S3 and cache the result, use a parallel ACL, or just allow access to anything by anyone?

mattrobenolt · on March 1, 2017

Good question. Authorization is passed along upstream to S3, but we don't re-check authorization when serving a cache hit. In our case, this is a fine tradeoff since our network is private and trusted.

mattrobenolt · on March 1, 2017

To be more clear, there are other things we could do here if we didn't inherently trust our network and the things running there.

AlexCoventry · on March 1, 2017

Thanks, I understand.

cagataygurturk · on March 1, 2017

Neat. You should be saving bandwidth costs also.

mattrobenolt · on March 1, 2017

As mentioned, we are saving a lot in S3 transfer cost. :)

mixedbit · on March 2, 2017

Can you disclose which provider do you use to host the servers? Do you have a fixed bandwidth or pay per GiB of transfer?

mattrobenolt · on March 2, 2017

SoftLayer. All internal bandwidth is free. Public network is a fixed allocation of bytes per server, then overages.

noway421 · on March 2, 2017

Essentially a fluke and not a general case people should follow.

mattrobenolt · on March 2, 2017

Disagree. Considering the original goal had nothing to do with expecting S3 to go down, it just happened to be super useful during this incident. We get more more benefits even without S3 breaking.

cakoose · on March 2, 2017

OP is a little abrupt, but it isn't downplaying the system's usefulness to you guys; just warning other readers that this isn't a general solution for, as the post is titled, "dodging S3 downtime". It's a "fluke" precisely because it "just happened" to help with something it wasn't designed for.

It's not uncommon for applications to have MRU access patterns and be able to keep partially functioning during partial data availability. For these applications, a cache will lower costs and mitigate S3 outages. It would have been nice for the article to give the criteria up front.

mattrobenolt · on March 2, 2017

Is paragraph 2 of the post not upfront enough? I clearly stated our goals of the project, and the fact that this also saves us through the incident was a great side effect. But I never set out to mitigate S3 failures like this.

cakoose · on March 2, 2017

> Is paragraph 2 of the post not upfront enough? I clearly stated our goals of the project

That paragraph says your connection to S3 is slow. Your solution doesn't fix general S3 slowness. However, it does make S3 slowness less of a problem for your application.

That's why it would have been useful to describe how your application accesses S3, so others can quickly determine if their applications would also benefit from a similar solution.

For example: "90% of our S3 reads are of blobs that were read in the last 30 days."

Many applications might upload data to S3, then quickly download it for processing. For those applications, this solution won't work.

> But I never set out to mitigate S3 failures like this.

We know. It's just that some people might read the title and think this is a more general solution than it is. OP was just warning people against that.