Adding NGINX caching on-top of this is pretty trivial.
Also, heads up, in the directive proxy_cache_path, they should consider enabling "use_temp_path". This directive instructs NGINX to write them to the same directories where they will be cached. We recommend that you set this parameter to off to avoid unnecessary copying of data between file systems. use_temp_path was introduced in NGINX version 1.7.10 and NGINX Plus R6.
use_temp_path=off
Also, they should enable "proxy_cache_revalidate". This saves on bandwidth, because the server sends the full item only if it has been modified since the time recorded in the Last-Modified header.
This is vulnerable to path expansion attacks. If someone passes a URL such as your site/s3/..EVIL_BUCKET/EVIL.js all of a sudden your site is serving someone else's content. Bad idea. Use virtual host style buckets instead, i.e S3_bucket.S3host/content.
Immutable blobs are really the right choice with s3, as it's eventually-consistent (when using it as a blob store anyway. If you're hosting a static site or similar it's a bit tricky to immutableize and not necessarily worth the effort).
Yep, file shas are a great choice. UUIDs are typically fine too.
One sort of weird case is if I have an image key (sha-based) and want to store thumbnail sizes:
'bae6ff187e4c491e5de9cfa3b039ce7da8255798' makes sense as a base key, but really I want bae6ff187e4c491e5de9cfa3b039ce7da8255798/400x400 for thumbnails rather than storing individual thumbnail shas, hah.
Good config, but if you're not defining the proxy in an upstream {} block, you can't make use of the keepalive parameter, which keeps a number of connections to the backend alive at any time, reducing the RTT for an actual request.
This is bad for stuff like this because nginx doesn't re-resolve DNS records after process startup. So if an IP address behind the hostname changes, things will just hard stop working. Using it explicitly as a variable coerces nginx into actually resolving DNS regularly to pick up changes like a normal client.
Alternate title: "Replacing S3 downtime for vastly greater amounts of your own downtime"
What is the name for this phenomenon where folks think they can out-available a thing that has multiple engineers singularly dedicated to nothing more than its availability /and/ operation? It it just hubris? Surely there must be a more clinical name.
Caching proxies are old hat. Things like squid have existed pretty much the entire lifespan of the web. Keeping things close to your servers means you don't have the inherent delay of pulling things from a remote server, regardless of how fast amazon makes things.
Additionally, not keeping all your eggs in Amazon's basket means you're not SOL when they have a datacenter hosting all your content go down. It also means that if and when a service that better fits your needs comes along, you are more readily able to migrate without problems.
Finally, site reliability is not something that takes a team the size of Amazon's. A lot of the things that are required for availability on AWS -- redundant systems providing services, standbys, etc -- are things that sysadmins were doing before AWS was extant. AWS' biggest gift to reliability is that its instances are less stable than most dedicated servers; you're taught from day one not to rely on a single server, so you build it right the first time.
So no, it's not hubris. It's calculating price/performance, it's applying things you're probably already doing to a new problem, and figuring out what the best solution really is, which rarely involves just throwing money at Amazon.
Yeah, one of our goals was to not add a new/weaker point of failure.
We gracefully fall back to S3 directly if our cache server is down without a hiccup. So there is no operational overhead of this additional cog. If the server has a failure, we'd go back to slightly degraded performance by talking across the country until we brought it back online.
The idea that it being a local proxy meaning there is no operational overhead is a dangerous fallacy.
If that's an earnestly literal statement from you, then it means you simply haven't encountered the failure modes that these kinds of set up are inclined towards.
I've worked at several BigCo's, seen them all implement this pattern, and seen every single one of them have fleet-wide outages due to these innocent "local proxies".
Sure, there's definitely risk. I'm not asserting that it's literally 0 chance. But this risk of this is also tied up with other things that leverage this proxy. So it's not adding a new dependency or a new point of failure. If this has a problem, we'll also have problems talking to other services in our network.
And for what it's worth, I've definitely fucked this up in the past and caused downtime as a result of a setup like this. The pros still outweigh the cons in practice.
> You can't add a new thing without adding a new point of failure. Every point is a point of failure.
Correct, but it's an existing process. So you're right, we could ship a blatantly bad config.
> Who deploys the thing?
We do, humans, yes. We can definitely screw up a config.
> Automated deployment?
We tend to do blue/green deploys on critical pieces of infrastructure just to sanity check it. We might even pull a node out of production, test on a staging server, etc.
> Wait, was the local proxy load tested?
Yes. The load we need for this case is not even close to significant.
> Was it load tested when one of your data centers is down and everything is doing 30% more work?
Yes, it's literally just a proxy to S3 doing no additional work. For our traffic, the load is not a concern. Especially since it's running on every machine, it's distributed pretty well. A single box cannot overload our haproxy process compared to the CPU needed to run the Python application itself.
> Can you tell I used to work in monitoring? Maybe I just have PTSD now. :P
DataDog, it's pretty dope. It gives us lots of super good insight into all of these things and is what alerted us because haproxy reported S3 down in the first place. It'd also tell is the moment a process like this crashes, etc.
It's running on localhost to each server. So the failure event here is that somehow haproxy process would explode with the rest of the server being fine. It's much much more likely that a whole machine will die instead or a network issue between machines, etc.
Sure, that's well understood. Being a low-risk point of failure, isn't it still a new one? It does come with setup and maintenance costs, test scenarios etc., so it's only fair to recognise this as a risk.
Technically yes. But we're pretty accustomed to this level of risk. For something like this, the pros far outweigh the cons involved. Yeah, it could fail. The maintenance overhead of this is absolutely minimal and took a handful of hours to have tested and in production.
Also worth noting, that this isn't really a single point of failure as a system wide thing. It'd only be a single point of failure on that single node. So if haproxy decided to explode, only that one machine would have a problem momentarily, while the process got started back up with our process manager.
The worst case scenario is a human error where we ship a bad config and break everything.
This is equivalent to shipping bad application code that takes everything down. Except the config is only a handful of lines of code and will very likely never change again. Also, we don't blindly roll out changes cluster wide for things like this without testing explicitly on staging or test nodes.
Shipping it with the app, you lose the cluster wide cached objects. A SPOF is the resolver. It's google but it's a SPOF. Is the failover to s3 automatic ? Or do you make a code change ? What kind of latency does that add ?
Thx. Got it. I don't know enough about the app, but I would have it serve directly from CF to users. Instead of hitting this environment for static assets. Good job and good conversation.
I don't necessarily agree with it, but the common response to this is all about timing.
With AWS you don't have any control over "higher risk" times. If you have a massive launch coming up, or you are nearing peak usage for the year, or your clients need you to be stable for the next few months, you can't put updates on hold, you don't know to get a few more people on standby, you can't choose to not make changes to your system, because it's not your system.
With an in house solution you can choose to lock it down for a month, or do the risky upgrades/changes at your lowest traffic time, or even give your customers a heads up if needed. Hell even just being able to mak e sure that your best sysadmin isn't out getting hammered when you go to make changes could go a long way.
There is some merit to that idea, but I personally feel the track record of many of these services is so near perfect that the chances of unexpected downtime is still smaller than most could realistically manage.
Maybe I'm reading this wrong, but to me it looks like their solution has a strict positive impact on reliability unless you are concerned about the local HAProxy node causing a problem. (Local to the running service, that is--it looks like it's on the same box?) It caches or falls through as appropriate, does it not?
It's called real world our clients running on dedicated hardware at 2 dcs have consistently less issues than those on AWS. The aws control layer and infrastructure is too complex and results in fairly significant outages.
That's it. I'm running my own dedicated servers for my business and had zero (ZERO!) downtime in the last three years. How much downtime had the big cloud players?
Me too. I've been renting servers for over 10 years. By the way is fairly common DCs with very high uptime (between 99.999% - 100% over 5 years or more), especially in cities with high connectivity like Ashburn/VA or Dallas/TX. Failures on servers with less than 4 years of use is _really_ unusual.
Every time cloud services have an outage, this line of reasoning becomes less and less appealing.
Your analysis assumes all other factors are constant. Change causes downtime, cloud services have a high rate of change (constantly pushing new configs, new code), many other servers don't.
This incident lasted for three hours. The last incident was in 2015 and was also a significant amount of time. Sentry can likely provision a new HAProxy node in minutes.
Most importantly, while S3 is relatively stable it's a black box. If you really care about high availability, you want to bring all of the points of failure under your direct control. On the other hand if availability is just a nice-to-have, relying on S3 is probably a better use of time.
> If you really care about high availability, you want to bring all of the points of failure under your direct control.
This is what exactly I'm talking about. People can't accept that trying to control it is not in any way guaranteed to make it more highly available. Bringing points of failure under your control makes no innate guarantee of improving anything. It can easily make it worse.
The only thing it can do is let you blame yourself of blaming S3.
You can only be hardened for what you have anticipated or experienced before.
There are tens thousands of wall clock hours of operational experience w/ S3. Availability is one of the top concerns of all AWS.
Thinking you can be more available is just fooling yourself. Believing you are more available will entail a willful ignorance or distortion of metrics.
Do your engineers carry pagers and have a <15 minute engagement time? Do your engineers sit at home when they are on-call because they know they can't simply let a page slide because they were in the middle of dinner? Or is your company more lenient than Amazon when it comes to operations?
Do your engineers spend a quarter fixing some failure mode of your infrastructure, or are they too busy working on features?
Is your team's performance measured by availability of your service? Or your actual core business?
Ironically, in our case, our availability is very core to our business. In this exact scenario, if S3 would have blocked our processing pipeline because it was down, that means we couldn't have alerted users as reliably that they were in fact, having issues because of S3 being down as well. So in our case, this is massively important to us and worth any additional risk that may be introduced.
Are you realizing that risk spread out over time is what creates your downtime average? So what you are saying is that your uptime is so important that it is worth additional downtime. It's an illogical statement.
You should actually read the blog post. We are strictly increasing availability in addition to S3's already amazing availability. Not adding another point of failure and assuming we're better. In fact, I literally assume I'm shitty, so I build defenses so my fuck ups don't cause any issues.
I still think a lot of this depends on your availability requirements.
I don't think I could run an S3 service at the scale AWS does with higher reliability, but I do believe I can run a pool of redundant HAProxies with higher reliability, and in fact we already run pools of HAProxies for other reasons so we have quite a bit of operational experience with it.
If my company had three engineers I certainly would not go this route, but if you are big enough to have a dedicated ops team that already has experience with this sort of thing, you can architect something that is more reliable than just relying on S3 alone.
Exactly. I'm sorry OP, but with all due respect, you're delusional if you think your strategy would improve upon AWS's availability. The only way you would do that is to deploy a solution like you mentioned on infrastructure that has higher availability than AWS. Hint: that doesn't exist. AWS may have low statistical availability for the month, but on any longer timescale they're still the very best in the game. You need to remember that whatever caused this was a black swan event.
I disagree. In the last few years S3 has had multiple of these black swan events, while the reverse proxies I am talking about have pushed through hundreds of billions of responses and have had significantly fewer incidents. (In this case, none)
I think the fallacy here is that you're not comparing apples to apples: I would be the last to argue that I could run a globally distributed S3 competitor better than Amazon. But I can (and have) run a massively simpler service with better overall uptime because it increases our options during upstream black swan events.
"But I can (and have) run a massively simpler service with better overall uptime because it increases our options during upstream black swan events."
I'm not saying it's impossible. But I am saying it's dangerous to omit from this conversation the idea that introducing the very --point of option-- can cause worse reliability than just using the downstream thing in the first place.
Sure, any new piece of infrastructure we add has the possibility for reducing reliability. We only introduce things we think we can support, and that add value for the company. YMMV.
We also have less than 15 minute incident engagement times, and don't let important pages slide through dinner. It's totally standard ops stuff: if one of the servers is down, we'll replace it when we get around to it. If they're all down, pages are going off.
You missed the major point where we're strictly improving and not thinking we're more available. In fact, I assume that we're less available, hence why we have a fallback from our local to S3 directly since I assume our single server is going to die more often than S3.
Honestly, think you've taken advice that is generally true ("service providers that focus on a problem can do it better than you") and applied it as dogma even when there are plenty of cases where it is flat out incorrect.
This is mostly unrelated to our original goals, which was reliability and performance. This wasn't to protect ourselves in the event of S3 going down, the timing just worked out that it saved us during that as well while also accomplishing the original goals.
If you were looking to protect yourself against S3 going down and had bigger drives, I assume you could use cron to sync the entire bucket and use `try_files` to prefer local and fall back to S3 if the file was missing?
Yeah, we could do something like this as well if we cared. We are easily caching 95+% of our active data in a small amount of disk space. It's not that valuable for us to have a complete replica of the full data set.
Sorry, and by reliable, I explicitly mean, the network connection from our datacenter to halfway across the country hiccuped more than I liked and was slower than I liked. Not reliability of the service itself.
Yes! All of the time! When people talk about moving away to their own datacenter, I like to ask how many of the top engineers in the world will be on-call at any time to monitor it?
Fortunately, I am top engineer and am on-call for what I implement and rely on. :) This is also why I build things in a way that don't add more risk to production infrastructure. If you'd read the post, you'd see that this is a strict improvement without risk on our side from introducing a new dependency.
> What is the name for this phenomenon where folks think they can out-available a thing that has multiple engineers singularly dedicated to nothing more than its availability /and/ operation?
S3 is an object storage system, they're adding a proxy. It's pretty easy to make a proxy that has better uptime than S3 because it's far, far less complex.
Lots of criticism in this thread by folks who are missing the point.
@mattrobenoit -- neat idea and the 70% savings in bandwidth is awesome. The side effect of helping mitigate the S3 issue for you was a sweet little bonus!
Also, should note, it's about a 98% savings in bandwidth, but cost savings explicitly was 70% since we have to factor in the cost of running this new server.
> This proxy service had been running for a week while we watched our bandwidth and S3 bill drop, but we had an unexpected exchange yesterday morning: [pic showing S3 offline]
The problem with solutions like these is you see global outages, like googles global VM outage last year.
Also from the public description this sounds like one big system, ie the second "region" may not be a public Google Cloud region. Just as much chance of an outage,
For object sync between on-premise and S3, AWS offers Storage Gateway. It supports IAM, encryption, bandwidth optimization, and local cache for hot objects with availability built in.
As with most services, you pay as you go and only for what you use. The prices are aligned with typical AWS storage, request volume, and data transfer costs.
This is a neat way to deal with it, and has numerous other benefits, but I thought I'd add another thing to try: deploying your data in multiple regions. You can set up a secondary bucket in a secondary region, and configure your primary bucket to replicate data to the secondary bucket automatically. And then set up your infrastructure to switch to the secondary bucket for read operations should there be a problem with the primary. With large amounts of data the costs could add up, but it has the advantage that you can still serve all read requests during a regional outage, not just the reads that are in the cache.
Obviously this method has cost associated with it, so you should probably only do this if you need complete data availability.
Or we can do what we did and save money instead and gain the guarantees. Plus achieve our original goal of performance by bringing the bytes inside our datacenter instead of going out through public internet.
Yeah, what I mentioned does nothing to improve performance, just thought I'd mention it as something to think about. For your use case S3 replication doesn't really give you extra availability. In practice I would likely use your caching method, and use S3 replication in addition, if I had a large dataset that doesn't cache well. Or if I absolutely needed to maintain write availability, in which case I would use bi-directional S3 replication.
Definitely. If our dataset was absolutely massive and we couldn't hold a reasonable amount on disk, it'd make more sense. Fortunately, we are getting a 90+% hit ratio out of a very small amount of space relative to the size of our bucket.
And yeah, we have 0 resiliance for write data here. Again, fortunately, we can afford this tradeoff since the amount of uploads is significantly lower and much less critical for us.
I put this elsewhere in the thread, but you can also set up bi-directional S3 replication. I haven't used it in production, but in theory it would mean you can continue write operations during failure. And those writes that are committed to the primary but not replicated when the region goes down wouldn't be lost, they'd come back up once recovery is complete (S3's SLA for data integrity is a lot stricter than it's uptime). Whether or not that is acceptable depends a lot on your use case.
Sure, though in many of S3's use-cases you won't care about writes at all, just read availability. E.g. for S3 buckets serving as canonical binary-asset hosts for CDNs to front.
Have you considered using S3 bucket replication and having your application's logic fail over to the replication target in the event of a regional failure? The former is a checkbox, the latter is 30 minutes of coding (in my experience.)
S3 bucket replication is a bit flawed, and doesn't buy us anything on top of what we implemented. And would cost double for storage. Plus more complex application logic which I wanted to avoid. Tools like haproxy are pretty good at this.
With S3 replication, you still have a primary/replica setup in which only one of them can accept writes, but you can accept reads from both. So we'd gain HA between multiple regions, but we wouldn't solve our original goals: speed. The round trip to S3 was too slow for us.
Very cool. I did read the article but I missed that speed was important, so the custom solution was an excellent choice.
Another idea might be to use Varnish for the caching layer, but I haven't compared Varnish to NginX in many years so the gap has probably been closed now?
Good work. I've stuffed this one in the back pocket for future use.
I have tons of experience with Varnish and a long history there. Varnish is really bad for this since it's memory only. I wanted to use 750GB of disk space, not the 32GB of RAM we had. 750GB of RAM is significantly more expensive.
And in our case, performance between reading some bytes from disk vs memory isn't significant. A disk seek is still many many orders of magnitude faster than a round trip to Amazon.
With that said, Varnish does offer the ability to use mmaped files, but the performance is really appalling out of the box, and just not worth it. Varnish is way better if you want strictly in memory cache.
Another benefit of nginx is the cache won't be dumped if the process restarts, unlike Varnish.
I'll have to look into this. I wasn't aware. Either way, I don't think that'd replace our current setup since the original intent wasn't to increase our availability. But good to know!
One thing that complicates easy replication is the rules around CNAMEing your S3 bucket to a subdomain you control. You have to name the bucket subdomain.your-domain.com.s3.amazonaws.com, then CNAME subdomain to s3.amazonaws.com. Your replicated bucket needs a unique name, so when you want to fail over it's unfortunately not just a matter of changing a DNS entry to point to the alternate bucket.
Of course, if you have complete control of the client you can just change the hostname. But if you have references scattered throughout html files, you'll likely want a reverse proxy in front of S3.
It directs traffic conditionally based on if our cache server is up or not, falling back to S3 directly. In theory, we could use nginx for this, but nginx isn't as good as a generic load balancer since it doesn't have as rich of insight into the status of upstreams. Also yes, the nice benefit of being able to do alerting based on uptimes and whatnot. We already have haproxy running here, so it was just adding another frontend to support this.
This would unfortunately be slower, since it's not within our datacenter and on our network, and significantly more expensive. Probably at least 100x more expensive for the bandwidth than S3 by itself.
This is a trusted, internal, private network. The only one who could do this is the application itself, or something rogue on our network. If something were running rogue on our network, there'd be worse things it could get access to.
I see, if this is private network then this is a nice simple solution for caching. We plan to implement S3-caching in minio [https://minio.io] (i.e it will authenticate the requests and also do caching) in case you'd be interested for public facing caching proxies.
How does authorization and access control interact with the proxy? Does it first check authorization with S3 and cache the result, use a parallel ACL, or just allow access to anything by anyone?
Good question. Authorization is passed along upstream to S3, but we don't re-check authorization when serving a cache hit. In our case, this is a fine tradeoff since our network is private and trusted.
Disagree. Considering the original goal had nothing to do with expecting S3 to go down, it just happened to be super useful during this incident. We get more more benefits even without S3 breaking.
OP is a little abrupt, but it isn't downplaying the system's usefulness to you guys; just warning other readers that this isn't a general solution for, as the post is titled, "dodging S3 downtime". It's a "fluke" precisely because it "just happened" to help with something it wasn't designed for.
It's not uncommon for applications to have MRU access patterns and be able to keep partially functioning during partial data availability. For these applications, a cache will lower costs and mitigate S3 outages. It would have been nice for the article to give the criteria up front.
Is paragraph 2 of the post not upfront enough? I clearly stated our goals of the project, and the fact that this also saves us through the incident was a great side effect. But I never set out to mitigate S3 failures like this.
> Is paragraph 2 of the post not upfront enough? I clearly stated our goals of the project
That paragraph says your connection to S3 is slow. Your solution doesn't fix general S3 slowness. However, it does make S3 slowness less of a problem for your application.
That's why it would have been useful to describe how your application accesses S3, so others can quickly determine if their applications would also benefit from a similar solution.
For example: "90% of our S3 reads are of blobs that were read in the last 30 days."
Many applications might upload data to S3, then quickly download it for processing. For those applications, this solution won't work.
> But I never set out to mitigate S3 failures like this.
We know. It's just that some people might read the title and think this is a more general solution than it is. OP was just warning people against that.
Also, heads up, in the directive proxy_cache_path, they should consider enabling "use_temp_path". This directive instructs NGINX to write them to the same directories where they will be cached. We recommend that you set this parameter to off to avoid unnecessary copying of data between file systems. use_temp_path was introduced in NGINX version 1.7.10 and NGINX Plus R6.
Also, they should enable "proxy_cache_revalidate". This saves on bandwidth, because the server sends the full item only if it has been modified since the time recorded in the Last-Modified header.