From the article: "Elastic Load Balancer (ELB): [...] It scales without your doing anything. If it sees additional traffic it scales behind the scenes both horizontally and vertically. You don’t have to manage it. As your applications scales so is the ELB."
From Amazon's ELB documentation: "Pre-Warming the Load Balancer: [...] In certain scenarios, such as when flash traffic is expected [...] we recommend that you contact us to have your load balancer "pre-warmed". We will then configure the load balancer to have the appropriate level of capacity based on the traffic that you expect. We will need to know the start and end dates of your tests or expected flash traffic, the expected request rate per second and the total size of the typical request/response that you will be testing."
Suffice to say, I went elsewhere.
10000000000 / (60 * 60 * 24 * 30) = 3,858 req/sec. That's a pretty good clip.
I'm a sysadmin with over 10 years with Linux. So for me to setup and support servers is pretty trivial.
The agreement I had with the supplier. They managed the network and hardware 24/7. I managed the setup and support of the servers from the OS up. This arrangement worked well and I had zero downtime.
This doesn't get mentioned as much as it should but there are VPS/dedicated providers who are very close to AWS DCs.
Enough so that for many use cases you should have your database in AWS and your app servers on dedictated hardware. Best of both worlds.
I don't mind. I went with self hosting. I found a supplier
which had their own scalable infrastructure.
If you want to know the supplier. They are called Mojohost.
Most people aren't worried about squeezing another 3% performance out of thei servers. In fact I would say the slice-and-dice nature of VMs allows for better overall capacity usage because of over provisioning of resources. How many apps do you know that hover at 0.07 load all day long?
"If you're willing to pay up to a 40% premium for the features cloud providers provide, pay them. If not, go bare metal."
I've heard of many start-up companies save plenty of money using dedicated hosting even without any operations / sysadmin pros around scaling to millions of users when the equivalent in AWS with relatively anemic nodes fared much better. In fact, WhatsApp only had a handful of physical servers handling billions of real users and associated internal messaging and they had developers as the on-call operations engineers.
I'm an ops engineer / developer and I'd use dedicated hosting if success depends a lot upon infrastructure costs. For example, if I started a competitor to Heroku at the same time they did, I'd definitely be having a very careful debate between dedicated / colo hosting and using a cloud provider tied intimately with my growth plans. Many companies have shockingly bad operations practices but achieve decent availability (and more importantly for most situations, profitability) just fine, so even the often-cited expectations of better networks and availability zones may be worth the risks of not caring that much.
Unlike many cloud providers AWS can be setup to serve a billion requests but you need to think that mess out from start to end. You can't setup an elb, turn on auto scale and then go out to lunch.
Also, as another comment here says, I believe a billion "users" is more like "requests" as users is vague and undefined. A single person could launch 1 or 100 requests depending on the app.
Why not work with AWS to mitigate such risks now that you know more about ELBs?
I agree Haproxy is solid, but ELBs are wonderful for internal microservices.
If you do decide to use Haproxy for microservices internally, I highly recommend Synapse from AirBnB: https://github.com/airbnb/synapse
So if you expect flash traffic, you just bump up your provisioned throughput. Simple and transparent.
It's not perfect, but works in a pinch.
The issue appeared to have been that the private IP workers all would transit the NAT box to get to the public service and the ELB seemed to act strangely when 99.99% of the traffic was coming from one IP address. The private ELB saw requests from each of the individual IP addresses of the workers and acted a lot better. Or something.
ELB (and most other managed service load balancers) are overpriced and not great at what they do. The advantage with them is easier setup and lack of maintenance.
If you're running a service with hundreds of millions or billions of requests, it's just far more effective in every way to use some small load balancing instances instead. Their Route53 service makes the DNS part easy enough with health checks.
Most people (and apps) likely won't hit this scale so ELB is just fine. If you do though, ELB is just pricey and not really that great.
Their whole position on them is super opaque and prewarming is still an issue.
I'll write more about this later, but so many people have had outages due to aws' inability to properly size these things.
RDS is great, but only to a certain level. You'll still need to pull it off RDS once you reach that service's capacity (much sooner than their 10m user mark). They also keep pushing Aurora, but without telling us what the tradeoffs are for the high availability. Based on the responses so far (MySQL backed by InnoDB), it appears to be based on a technology similar to Galara, which has a lot of caveats for its use, especially with multiple writers.
Don't depend on Elastic Scaling for high availability - when an AZ is having issues, the AWS API will either be down or swamped, so you want to have at least 50% extra capacity at all times, if you want high availability.
Using their scaling numbers, your costs start spiking at 10 users. Realistically, with intelligent caching (even something as simple as Nginx caching), you can easily support several thousand users just fine with a t2 style instance, either a small or micro. Splitting services onto different hosts not only increases your hosting costs, it increases the workload on your developers/admins and likeliness of failure.
DR: Don't wait until you have over a thousand users to have multiple instances in different AZs. The cost of duplicating a t2.small across an AZ is small compared to lost users or sales.
Automation: Be prepared for vendor lockin if you use Amazon's solutions. Also be prepared for their APIs being unavailable during times of high load or during AZ failures.
> Lambda [...] We’ve done away with EC2. It scales out for you and there’s no OS to manage.
The biggest problem with Lambda right now are the huge latency costs with cold lambda instances. You'll get a pretty good 95% percentile response times, but that other 5% will be off-the-chart bad.
In summary, AWS has a lot of great toys, and can absolutely be used for scaling up to silly levels. However, most who have done this degree of scaling do not do so using AWS tools.
agreed, the article approach to scalability is to throw silly amounts of money at the problem, instead of going for an architecture to squeeze first every bit of performance out of the app. true this approach is pretty simple and works for any kind of application, but the RDS will hit connections cap quite fast if on just throws instances at the problem.
edit: yep, just noticed this comes from a Amazon Web Services Solutions Architect, of course the solution is to throw money at them
Yup. They put out a white paper at one point on surviving DDOS attacks on AWS which amounted to "out-scale the attack". AKA the Wallet based DDOS.
Yep. I've recently load tested (with Locust) a Flask/uWSGI/Nginx webapp I built that does Pandas DataFrame queries based on user input and serves data computed from the query result. I put a bit of effort into profiling and optimizing the Python code^1, and I do caching in uWSGI. Running on the equivalent of a single t2.small instance, it can handle about 70,000 requests per hour, which I figure is the equivalent of a few thousand simultaneous users^2. For just serving a dynamic webpage from Flask it can handle almost a million requests per hour.
^1 (Surprisingly, a Pandas DataFrame lookup like `df[df.alpha == input]` can be almost an order of magnitude faster if you replace `df.alpha` with `df.alpha.values`.)
^2 (The data it serves is input for simulation codes which take hours to run on the user's hardware, so 30 lookups per hour is probably more than a typical user would do.)
Edit: asterisk doesn't work as a footnote symbol here...
What really kills you on AWS are the insane bandwidth prices. Buying bandwidth elsewhere is often so much cheaper than AWS that the difference in bandwidth costs alone more than finances the servers.
I've heard the non-AWS folks talk of these vendor lock ins or long term costs but aren't those irrelevant in 2016+? eg. microservices to reduce the issue of vendor lock in and long term costs on infrastructure that goes out of date every 2-3 years is a poor planning indicator no?
I know first hand the kind of discounts some companies much, much smaller than Netflix can get, and they are steep. EC2 is still expensive then too, but if you pay, say, a million a year to Amazon without massive discounts, you've not done your job when negotiating.
But yes, someone with the leverage Netflix has will be paying relatively reasonable rates for EC2 services. But pretty much nobody else has the leverage Netflix has.
> I've heard the non-AWS folks talk of these vendor lock ins or long term costs but aren't those irrelevant in 2016+?
Paying far above market rates is never going to be irrelevant, because if you pay above market and your competitor doesn't, chances are they'll have you for breakfast thanks to better margins.
Why in the world would you agree to pay above market rates to get locked in for 1-3 years when you can pay less on a month-by-month contract?
You can run it on AWS or somewhere else but moving is always a problem regardless.
In fact, they don't even use the same hardware or software.
When we looked at S3 for some archiving use cases, that came up as a risk -- if strategically it made more sense for us to adopt Google, Microsoft, etc, we would need to negotiate significant concessions from a new vendor to transition away from Amazon or take a hit during that period. You always need to plan for the exit!
You'll have similar issues on-premises (ie. dealing with EMC/etc), but many people forget that cloud providers have their own gotchas too.
TBH The cost of AWS isn't what concerns me so much as the massive vendor lock-in.
Not only that, but if you invest significantly in any single technology, migrating to another technology is always going to be an extreme effort. Having led migrations from datacenters to AWS, AWS to Digital Ocean, RabbitMQ to NSQ to SNS+SQS, etc., I can say at this point that I do not believe in vendor lock-in as a legitimate reason to disqualify any particular solution.
Outside of large volume S3, it's pretty trivial to beat AWS costs, assuming you have the human capability. S3 is a little different, as the capital investment required to host petabytes of data is very high, and Amazon's economy of scale is pretty compelling.
For most anything else, dedicated boxes at a colo or your own datacenter should be cheaper, assuming you have the people around to do it, etc
1) Use Load Balancer + Autoscaler for all service layers. This effectively makes each layer a cloud of on-demand microservices.
2) Use Cloud Datastore: (NoSql) Maybe I lucked out that I don't have complex relational data to store, but Cloud Datastore abstracts out the entire DB layer, so I don't have to worry about scaling/reliability ever.
... aside from random devops stuff, that's pretty much it. The key point is to "cloudify" each layer of the infrastructure.
Most of Google Cloud is built to operate the same way with 1 user or 1m users. And in many cases, Google doesn't charge you for the "scaling vector", whereas AWS will, and will sometimes even require a separate product (see Firehose).
Things like Load Balancer not requiring pre-warming, PubSub seamlessly scaling, Datastore and AppEngine seamlessly scaling.
This is especially obvious on the product I work on, BigQuery:
- We had a customer who did not do anything special, did not configure anything, didn't tell us, and ingested 4.5 million rows per second using our Streaming API for a few hours.
- We frequently find customers who scale up to 1PB-size without ever talking to us. I can be their first point of contact at Google.. after they're at that scale.
- Unlike traditional Databases, BigQuery lets you use thousands of cores for the few seconds your query needs them, and you only pay for the job. If I were to translate this to VM pricing, BigQuery gives you ability to near-instantly fire up thousands VMs, shut them down in 10 seconds, and only pay per-second. Customers like that kind of thing :)
Disclosure: Shamelessly biased
You should amend that to say AWS EC2 isn't the best solution. Unless you've got some pretty high utilization (either CPU or bandwidth out) of that conventional VPS host, you can buy a lot of API Gateway/Lambda for the $10/mo you pay for your VPS host and get higher availability and scalability basically free.
As a hypothetical example, let's say I have an API backend that needs 250ms of CPU to generate a 16KB response, and uses 512MB of memory. I can run this on a $9/month VPS  and, at full utilization, handle about 21 million requests per month.
Handling the same volume of requests on AWS Lambda is not just more expensive, but hugely more expensive. You end up paying about $4 in request charges, $73 for the "request gateway", $15 for the computation itself, and $30 for bandwidth. That's more than 13 times the cost, and I haven't even factored in data storage. You could buy two VPSes for fault-tolerance, hugely over-provision both of them, and you'd still end up spending less money than Lambda.
If your application is lightweight enough that even a single VPS is dramatically more than you need, then yeah, Lambda's pricing model could save you some of those last few dollars. But if you expect to grow, then you probably don't want to lock yourself into an API that will become much more expensive later on.
This is still the most economically sensible infrastructure strategy. Maintain a core group of dedicated servers responsible for a threshold workload. When they can no longer handle all incoming work, they offload the excess to temporarily provisioned cloud workers.
- Guarantee you are only getting price gouged by Amazon for a subset of your traffic
- Force yourself to build software that runs on multiple platforms
- Address scaling requirements up front
Perhaps most importantly, this strategy creates a profit incentive for increasing compute efficiency, regardless of Amazon's pricing structure. Every increase in software efficiency means that the same group of core servers can serve more requests, so you can pay less to Amazon.
On the other hand, it looks like Lambda could be pretty great for small personal projects. It would be even better if they added a modest free tier to the request gateway, to match the other services.
That's still less money and about 1000x the performance without the hassle of dealing with the API/Lambda development experience. Just deploy your webapp to 2 both instances without downtime and you'll be serving hundreds of thousands of users.
Amazon doesn't provide any extraordinary high-availability or reliability beyond what you can just do yourself. Their managed services are just running on their own private resources using the same AWS infrastructure, just with more money and people.
With the big advantage of when something goes sideways I can actually debug the problem, for the scale of most of the systems we run one client per VPS with a backup for some is just fine (though we are transitioning the spares onto a different provider from the primary after Linode took a pasting).
Also looking at getting a couple of beefy dedicateds down the line and running Xen for the stuff we really need to not be wiped out.
AWS is excellent for a given set of trade-offs but if you have a good Ops background you can save some money which is nice but (for me) more crucially you can access your entire stack and move wherever you want.
Also frankly I loathe dealing with AWS's web interfaces for anything - frankly they are embarrassingly bad for a company that prides itself on end user experience.
(I'm assuming you're talking metaphorically, as for my part we use onsite repair warranties to deal with failure of new hardware, and just replace old hardware except when it's something very obvious like a failed drive - it's rarely worth the trouble to do a lot of diagnostics at smaller scales; in any case you can still save and avoid this by using a managed hosting provider)
I've just moved a client off EC2 because the premium they were paying would have been a massive problem. The 85% reduction in hosting cost has bought them months of extra runway. Their operational costs related to their hosting also dropped - there's simply been fewer issues to deal with.
I'm sure there are instances where AWS is fine. But there are also plenty of cases where it is a matter of survival to cut those costs.
But I don't live in a world where runway is a consideration so YMMV. At the time I commented, the parent post was getting downvoted. I've seen that knee jerk reaction on HN multiple times, and that is what prompted my comment.
It's kind of hard to get numbers though since most private companies don't trumpet their revenue numbers or engineering headcount.
I see a lot of pessimism about AWS in this thread but its unfounded.
The sheer number of success stories on AWS at every scale is amazing. This guide demonstrates the diverse set of services AWS offers for customers from zero to Netflix. AWS is world-class engineering and operations that can be summoned by a single API call.
There might be ways to cut monthly costs on other providers, but many people forget to factor in your time to research, design stand up and operate software. I'd go all in on SQS, with all it's design quirks and potential costs, over rolling my own RabbitMQ cluster on Digital Ocean any day.
I'm biased, working full time on open source tools to help beginners on AWS at Convox (https://github.com/convox/rack), but frankly there's not a better time to build and scale your business on AWS. The platform is pure productivity with very little operational overhead.
Are they still doing world-class ICMP filtering, breaking PMTUD?
the rule of thumb is once you hit $20-99k/month, you can cut your AWS bill in half somewhere else. sites in this phase generally only use about 20% of the features of aws.
the other rule of thumb is once you hit six figures/month, you're probably spending someone else's money, are locked in to their stack, or just don't really care to begin with, so there's no point in telling/selling you otherwise.
If I have something in an environment that I would start to consider "production" (i.e. someone relies on my product to do something regularly), then I'd have monitoring regardless of the number of users. Even something as simple as, "Am I returning valid data from GET /"?
But are there benefits to using Amazon's more high-level services such as SQS and SNS which, supposedly, replicate their configuration state and data across multiple regions, in terms of reliability?
For instance, on a per-instance basis AWS might be more expensive than a bare-metal provider, and there's nothing to stop you running your own RabbitMQ instance. But SQS messages are replicated across three regions, so if you were building an equivalent service you'd need three instances in different regions and a reliable distributed message queue.
So does that additional complexity/cost make SQS at all worthwhile? Or does it come down to the fact that, while your own hand-rolled service would require more management, your potential message throughput at a given cost would be much higher than with SQS?
In the end, I might just pay a little more for a faster server. Keep things simple, everthing on the one app.
It's a "normal" app (in the grand scheme of the Internet), so 10 users at a time would be high traffic already.
I agree with you.
In tribute to the Dead Milkmen, in case you want to sue me, I'm talking about this book - http://www.amazon.com/Magic-Beans-Nutrient-Rich-Disease-Figh...
Once your at a large enough scale, then yes engineer costs become a smaller component and becomes worth it.
More dynamic applications (like a commenting system) might feel better at 10-30 seconds of caching with expiration commands, but it will still help scale up significantly.
Am I right in thinking that such caching comes built-in with PHP 5.5+ ?
You can run wordpress (a fairly unoptimized app) on a tiny linux VM and easily serve 50 requests per second. That's 4M requests over 24 hours.
If you need more than that, just upscale your server. 1 midsize server these days can handle 100M requests per day without a problem if it's just running a basic site.
Using a pay-as-you-go CDN service would likely be the way you would want to go just so that you aren't tied down to any monthly commitment that you may not end up fully using.
I would suggest taking a look at KeyCDN (https://www.keycdn.com/) which is quite affordable.
CDN's are very cheap and easy to setup. No big contracts or commitments these days. You can use them just for the static assets or for your entire site to make it faster for everyone while also reducing requests to your origin server.
MaxCDN is cheap and effective or you can use CloudFlare and get their security features too and not worry about bandwidth.
I like very much the Amazon's AWS. I use them extensively. But apparently some folks goes a little crazy to adopt cloud services as final solution for every use case. They have no idea how much traffic a real high-end server fully loaded with memory and SSD disks should handle these days.
> Put caching in front of the DB
Isn't that a little late?
Start with SQL and only move to NoSQL when necessary.
Users > 10.000.000+:
Moving some functionality to other types of DBs (NoSQL,
Patient records is one I can think off.
(Amazon.com retail website runs on EC2 and AWS since 2010)