Hacker News new | past | comments | ask | show | jobs | submit login
A Beginner's Guide to Scaling to 11M+ Users on Amazon's AWS (highscalability.com)
445 points by dsr12 on Jan 12, 2016 | hide | past | favorite | 146 comments

I work in the entertainment / ticketing industry and we've been burned badly before by relying on AWS' Elastic Load Balancer due to sudden & unexpected traffic spikes.

From the article: "Elastic Load Balancer (ELB): [...] It scales without your doing anything. If it sees additional traffic it scales behind the scenes both horizontally and vertically. You don’t have to manage it. As your applications scales so is the ELB."

From Amazon's ELB documentation: "Pre-Warming the Load Balancer: [...] In certain scenarios, such as when flash traffic is expected [...] we recommend that you contact us to have your load balancer "pre-warmed". We will then configure the load balancer to have the appropriate level of capacity based on the traffic that you expect. We will need to know the start and end dates of your tests or expected flash traffic, the expected request rate per second and the total size of the typical request/response that you will be testing."

You'd be surprised about how many people don't know this. I had an expectation to scale past 1B users. I was trialling AWS when I realised through testing that it was this way. It could not deal with sudden spikes of traffic.

Suffice to say, I went elsewhere.

A billion users? Are you Facebook or the Olympics?

Neither. But once you start doing something like serving ads. The paradigm shifts. Of course, what I do is a lot more intensive/complex. But I'll say this to get the basics across.

It doesn't take facebook. I'm in a small adtech company. Tens of billions of requests a month is not unexpected.

> Tens of billions of requests a month is not unexpected.

10000000000 / (60 * 60 * 24 * 30) = 3,858 req/sec. That's a pretty good clip.

That's a small adtech company. The larger ones do that per day with some over 50B/daily.

Yep. I spent some time working for one of the largest.

We see 10,000 req/sec on a regular basis.

It's not always _users_, but requests. As companies embrace microservices, I think you'll see a moderately sized application pushing tons of requests over HTTP that would normally have used a different protocol

Where did you go, if you don't mind expanding?

I don't mind. I went with dedicated hosting. I found a supplier which had their own scalable infrastructure. They already had clients which had ad server type applications that scaled into the Billions and could handle traffic spikes. With that type of setup, it was a no brainer.

I'm a sysadmin with over 10 years with Linux. So for me to setup and support servers is pretty trivial.

The agreement I had with the supplier. They managed the network and hardware 24/7. I managed the setup and support of the servers from the OS up. This arrangement worked well and I had zero downtime.

> I went with dedicated hosting

This doesn't get mentioned as much as it should but there are VPS/dedicated providers who are very close to AWS DCs.

Enough so that for many use cases you should have your database in AWS and your app servers on dedictated hardware. Best of both worlds.

Can you share a list of providers that are close to AWS DCs?

Pretty much any data center in Virginia will be close to US-EAST. If you contact them for setting up direct connect pipes they'll also provide you with a list of locations to check out.

You'll have to compare regions depending on providers. Softlayer has pretty good coverage with matching regions and low latency.

  I don't mind. I went with self hosting. I found a supplier 
  which had their own scalable infrastructure.
That's a little vague. By "self-hosting" you mean Linux VMs, like EC2, right, or something more abstracted than that? What supplier?

Sorry, I just updated the post. I meant dedicated hosting. So bare-metal machines.

If you want to know the supplier. They are called Mojohost.


When you need performance, bare metal is always the way to go.

This saying holds such little value for so many engineers. They want uptime, ease of management, and security.

Most people aren't worried about squeezing another 3% performance out of thei servers. In fact I would say the slice-and-dice nature of VMs allows for better overall capacity usage because of over provisioning of resources. How many apps do you know that hover at 0.07 load all day long?

Okay, how's this:

"If you're willing to pay up to a 40% premium for the features cloud providers provide, pay them. If not, go bare metal."

Fair enough.

All they say is it costs 125$. 125$ for what ? They do not mention the specs of the hardware in their website.

If you hadn't been a sysadmin, would still have chosen dedicated hosting? (Given that you have serious scaling requirements, of course). In other words: Would it be realistic to say that a service like Elastic Beanstalk saves on hiring a sysadmin?

Sysadmins / operations people should be able to handle anything below an OS better than your usual devops guys that would be able to build you a variation of EBS and their value further depends upon if your software has special needs that are not suitable for cloud / virtualized infrastructure.

I've heard of many start-up companies save plenty of money using dedicated hosting even without any operations / sysadmin pros around scaling to millions of users when the equivalent in AWS with relatively anemic nodes fared much better. In fact, WhatsApp only had a handful of physical servers handling billions of real users and associated internal messaging and they had developers as the on-call operations engineers.

I'm an ops engineer / developer and I'd use dedicated hosting if success depends a lot upon infrastructure costs. For example, if I started a competitor to Heroku at the same time they did, I'd definitely be having a very careful debate between dedicated / colo hosting and using a cloud provider tied intimately with my growth plans. Many companies have shockingly bad operations practices but achieve decent availability (and more importantly for most situations, profitability) just fine, so even the often-cited expectations of better networks and availability zones may be worth the risks of not caring that much.

We went to Softlayer with their smallest instances running Nginx to load balance everything. Much faster and cheaper.

Why in the world would you assume any off-the-shelf solution would serve a billion users?

Unlike many cloud providers AWS can be setup to serve a billion requests but you need to think that mess out from start to end. You can't setup an elb, turn on auto scale and then go out to lunch.

Why not? That's exactly the use case, if you dont need to prewarm for bursty loads. It'll just be extremely expensive.

Also, as another comment here says, I believe a billion "users" is more like "requests" as users is vague and undefined. A single person could launch 1 or 100 requests depending on the app.

What other vendor did you go with and now looking back was it worth it from a cost & operational perspective?

Why not work with AWS to mitigate such risks now that you know more about ELBs?

This might be of interest, Netflix pre-scales based on anticipated demand: http://techblog.netflix.com/2013/11/scryer-netflixs-predicti...

After testing ELB and seeing the scaling issues, we ended up going to a pool of HAProxies + weighted Route53 entries. Route53 does a moderately good job of balancing between the HAProxies, and the health checks will remove an HAProxy if it goes bad. HAProxy itself is rock solid. The first bottleneck we came across was HAProxy bandwidth, so make sure the instance type you select has enough for how much bandwidth you expect to use.

Do health checks work within a VPC? My understanding was they don't, so this only works for externally facing services.

I agree Haproxy is solid, but ELBs are wonderful for internal microservices.

If you do decide to use Haproxy for microservices internally, I highly recommend Synapse from AirBnB: https://github.com/airbnb/synapse

Ruby, High Availablity and High Scalability? Despite idempotency, I'm not sure how comfortable I am with that.

Synapse is a service discovery framework. Essentially, it just writes HAProxy config files based on discovered upstreams - it does not receive any requests itself. The scalability is handled by HAProxy.

I was under the impression that HAProxy is what it is powering Amazon's ELB service.

I wish Amazon would switch to a 'provisioned throughput' model for ELB like they have for DynamoDB, where you say what level of throughput you want to support and you're billed on that rather than actual traffic. Then they keep sufficient capacity available to support that service level.

So if you expect flash traffic, you just bump up your provisioned throughput. Simple and transparent.

You can contact AWS support if needed, and they'll warm up the ELB ahead of time.



It's not perfect, but works in a pinch.

That would be a very cool offering.

Another gotcha is that ELB appears to load balance based on the IP addresses of the requests... We had private VPC/IP workers talking hundreds of requests per second to a non-sticky-session, public ELB fronted service (... don't ask why ...) and experienced really strange performance problems. Latency. Errors. What? Deployed a second private ELB fronting the same service and pointed the workers at it. No more latency. No more errors.

The issue appeared to have been that the private IP workers all would transit the NAT box to get to the public service and the ELB seemed to act strangely when 99.99% of the traffic was coming from one IP address. The private ELB saw requests from each of the individual IP addresses of the workers and acted a lot better. Or something.

Elbs are one of the known biggest weaknesses of aws...

Their whole position on them is super opaque and prewarming is still an issue.

I'll write more about this later, but so many people have had outages due to aws' inability to properly size these things.

I went to a meetup about 2 years ago and one of the engineers from CloudMine gave a talk about load balancing on AWS. CloudMine ended up dumping ELB for HAProxy to handle their scaling needs.

how does HAProxy compare to OpsWorks? the HAProxy wikipedia page mentions OpsWorks is based on it

Nginx running on a tiny instance can load balance 100k connections at thousands of requests per second. The network bandwidth for the instance will probably be saturated way before the CPU/RAM becomes a problem.

ELB (and most other managed service load balancers) are overpriced and not great at what they do. The advantage with them is easier setup and lack of maintenance.

If you're running a service with hundreds of millions or billions of requests, it's just far more effective in every way to use some small load balancing instances instead. Their Route53 service makes the DNS part easy enough with health checks.

Why do you say they're overpriced? I would say for most apps their downright cheap. Especially since you spend so little time tinkering/monitoring/worrying about them. Most people just want to work on their app not manage Nginx configs.

There is absolutely a tradeoff (as with everything in life) but in the context of this thread talking about scale with 100s of millions of requests, gigabytes of bandwidth and large spikes - it's far better to just host your own load balancers.

Most people (and apps) likely won't hit this scale so ELB is just fine. If you do though, ELB is just pricey and not really that great.

Link to the documentation? I thought this was changed over a year ago to not requiring pre warming?

Hoo boy. Here we go. The problem with AWS reps is that they only see everything as working perfectly, with no possibility for downtime of their services.

RDS is great, but only to a certain level. You'll still need to pull it off RDS once you reach that service's capacity (much sooner than their 10m user mark). They also keep pushing Aurora, but without telling us what the tradeoffs are for the high availability. Based on the responses so far (MySQL backed by InnoDB), it appears to be based on a technology similar to Galara, which has a lot of caveats for its use, especially with multiple writers.

Don't depend on Elastic Scaling for high availability - when an AZ is having issues, the AWS API will either be down or swamped, so you want to have at least 50% extra capacity at all times, if you want high availability.

Using their scaling numbers, your costs start spiking at 10 users. Realistically, with intelligent caching (even something as simple as Nginx caching), you can easily support several thousand users just fine with a t2 style instance, either a small or micro. Splitting services onto different hosts not only increases your hosting costs, it increases the workload on your developers/admins and likeliness of failure.

DR: Don't wait until you have over a thousand users to have multiple instances in different AZs. The cost of duplicating a t2.small across an AZ is small compared to lost users or sales.

Automation: Be prepared for vendor lockin if you use Amazon's solutions. Also be prepared for their APIs being unavailable during times of high load or during AZ failures.

> Lambda [...] We’ve done away with EC2. It scales out for you and there’s no OS to manage.

The biggest problem with Lambda right now are the huge latency costs with cold lambda instances. You'll get a pretty good 95% percentile response times, but that other 5% will be off-the-chart bad.

In summary, AWS has a lot of great toys, and can absolutely be used for scaling up to silly levels. However, most who have done this degree of scaling do not do so using AWS tools.

> Realistically, with intelligent caching (even something as simple as Nginx caching), you can easily support several thousand users just fine with a t2 style instance

agreed, the article approach to scalability is to throw silly amounts of money at the problem, instead of going for an architecture to squeeze first every bit of performance out of the app. true this approach is pretty simple and works for any kind of application, but the RDS will hit connections cap quite fast if on just throws instances at the problem.

edit: yep, just noticed this comes from a Amazon Web Services Solutions Architect, of course the solution is to throw money at them

> of course the solution is to throw money at them

Yup. They put out a white paper at one point on surviving DDOS attacks on AWS which amounted to "out-scale the attack". AKA the Wallet based DDOS.

> you can easily support several thousand users just fine with a t2 style instance

Yep. I've recently load tested (with Locust) a Flask/uWSGI/Nginx webapp I built that does Pandas DataFrame queries based on user input and serves data computed from the query result. I put a bit of effort into profiling and optimizing the Python code^1, and I do caching in uWSGI. Running on the equivalent of a single t2.small instance, it can handle about 70,000 requests per hour, which I figure is the equivalent of a few thousand simultaneous users^2. For just serving a dynamic webpage from Flask it can handle almost a million requests per hour.

^1 (Surprisingly, a Pandas DataFrame lookup like `df[df.alpha == input]` can be almost an order of magnitude faster if you replace `df.alpha` with `df.alpha.values`.)

^2 (The data it serves is input for simulation codes which take hours to run on the user's hardware, so 30 lookups per hour is probably more than a typical user would do.)

Edit: asterisk doesn't work as a footnote symbol here...

Agreed - they are a great solution for small teams that are growing fast and don't have predictability. But, once you have some level of predictability and scale, it makes sense to move off to something much higher performance and lower cost. Until you become a decrepit Fortune 50 company and can't manage an IT department due to bloat, and it's cheaper to outsource.

Curious what you see out there that's higher performance and lower cost than AWS? In my experience it's been a great fit for small apps all the way up to large complicated applications at scale - and once your infrastructure is large enough you're buying reserved instances anyway at anywhere between a 33% and 70% discount.

You can beat AWS on cost with pretty much any hosting provider (with some exceptions - e.g. Rackspace seems almost proud to be expensive). The 33% to 70% "discount" doesn't mean much when you then tie yourself into long term costs that are far more limiting than most manage hosting providers - so much for benefits of being able to scale up and down.

What really kills you on AWS are the insane bandwidth prices. Buying bandwidth elsewhere is often so much cheaper than AWS that the difference in bandwidth costs alone more than finances the servers.

How is Netflix able to manage this so effectively and still serve ~30% of US traffic off AWS?

I've heard the non-AWS folks talk of these vendor lock ins or long term costs but aren't those irrelevant in 2016+? eg. microservices to reduce the issue of vendor lock in and long term costs on infrastructure that goes out of date every 2-3 years is a poor planning indicator no?

I can guarantee you that Netflix are not paying anything remotely like the advertised rates for EC2.

I know first hand the kind of discounts some companies much, much smaller than Netflix can get, and they are steep. EC2 is still expensive then too, but if you pay, say, a million a year to Amazon without massive discounts, you've not done your job when negotiating.

But yes, someone with the leverage Netflix has will be paying relatively reasonable rates for EC2 services. But pretty much nobody else has the leverage Netflix has.

> I've heard the non-AWS folks talk of these vendor lock ins or long term costs but aren't those irrelevant in 2016+?

Paying far above market rates is never going to be irrelevant, because if you pay above market and your competitor doesn't, chances are they'll have you for breakfast thanks to better margins.

Why in the world would you agree to pay above market rates to get locked in for 1-3 years when you can pay less on a month-by-month contract?

Netflix could even be paying less than cost, as a loss-leader for AWS.

Feels like AWS is less of a vendor lock than building it inhouse. Doing it all inhouse has a high upfront cost that must be realized over X years irrelevant of the outcome. On the other hand if one implemented a microservices architecture, moving off AWS month-to-month service to another provider is far easier. Did I miss something?

How is microservices related here? They're built in-house too. It's still just services/apps/code that has to run somewhere.

You can run it on AWS or somewhere else but moving is always a problem regardless.

There are no month-to-month costs with Amazon that I'm aware of. There are hour by hour, and 12 month and 36 month commitments.

Netflix does not stream content from AWS.

+1. Netflix.com is only the control plane, all content is served from CDNs.

The majority are all of Netflix's CDN traffic comes from their own CDN that they do not run on Amazon.

In fact, they don't even use the same hardware or software.


Keep in mind that Amazon (and others) uses the "roach motel" model for networking. Easy to check in, not so easy to check out.

When we looked at S3 for some archiving use cases, that came up as a risk -- if strategically it made more sense for us to adopt Google, Microsoft, etc, we would need to negotiate significant concessions from a new vendor to transition away from Amazon or take a hit during that period. You always need to plan for the exit!

You'll have similar issues on-premises (ie. dealing with EMC/etc), but many people forget that cloud providers have their own gotchas too.

I suspect Netflix is paying something a lot closer to AWS cost price than any of us will get.

TBH The cost of AWS isn't what concerns me so much as the massive vendor lock-in.

Vendor lock-in is an unavoidable cost of doing business. Even if you build literally everything yourself, which you shouldn't, you still have resources, processes, apis, automation, expertise amassed around a specific set of operating constraints.

Not only that, but if you invest significantly in any single technology, migrating to another technology is always going to be an extreme effort. Having led migrations from datacenters to AWS, AWS to Digital Ocean, RabbitMQ to NSQ to SNS+SQS, etc., I can say at this point that I do not believe in vendor lock-in as a legitimate reason to disqualify any particular solution.

In my mind, it's like leasing a car. Leasing is better for your cash flow, but buying is usually a lower total cost.

Outside of large volume S3, it's pretty trivial to beat AWS costs, assuming you have the human capability. S3 is a little different, as the capital investment required to host petabytes of data is very high, and Amazon's economy of scale is pretty compelling.

For most anything else, dedicated boxes at a colo or your own datacenter should be cheaper, assuming you have the people around to do it, etc

The other problem with Lambda is that you cannot keep persistent connections in a connection pool. It is after all, designed for statelessness. This can be considerable cost for doing calls to other business services (http connection pools) or infra services like databases that all maintain persistent connections.

This isn't true. I run a Lambda right now that queries a Cassandra connection pool at high volume. In Java, at least, you set up your resources in a static initializer block, as this alludes to. http://docs.aws.amazon.com/lambda/latest/dg/best-practices.h... Problem solved.

Absolutely. The overhead of re-establishing a secure DB connection for every request is hardly trivial.

It would be, if it were necessary, but it's not. (Static initializers or default constructors in Java, for example.)

Question then: How do you omit the overhead of setting up a new socket and all of the SSL handshakes? I'm not concerned about the Java overhead associated with new connections, I'm concerned with the raw connectivity/handshake overhead required with new connections to the DB.

It happens once, on initialization. :) The first execution takes anywhere from 50-70 seconds, for sure, but reusing the connection afterwards means subsequent ones don't have to deal with it (100-200 ms a pop). (Does that make sense?)

50 seconds?

Agreed on Lambda latency costs. I've used it to process API calls and I noticed it can add almost half a sec to the response or sometimes even longer.

This is a bit of a hack workaround, but all you need to do is have the function run at least every ten minutes. So, using the scheduled task feature, just kick off an event every ten minutes that invokes the function with a custom event that you can respond to instantly within the event handler (to minimize costs). Once you set that up, the function will never scale down and you'll always get hot boot times for just a few pennies extra per month.

tbh if you're that concerned about the 5% of response times being affected by cold Lambdas, then maybe lambda isn't really the solution to the problem you are trying to tackle.

I went with Google Cloud, and my 1 to 10 user infrastructure is the same as 1million+ users:

1) Use Load Balancer + Autoscaler for all service layers. This effectively makes each layer a cloud of on-demand microservices.

2) Use Cloud Datastore: (NoSql) Maybe I lucked out that I don't have complex relational data to store, but Cloud Datastore abstracts out the entire DB layer, so I don't have to worry about scaling/reliability ever.

... aside from random devops stuff, that's pretty much it. The key point is to "cloudify" each layer of the infrastructure.

This story doesn't get told enough.

Most of Google Cloud is built to operate the same way with 1 user or 1m users. And in many cases, Google doesn't charge you for the "scaling vector", whereas AWS will, and will sometimes even require a separate product (see Firehose).

Things like Load Balancer not requiring pre-warming, PubSub seamlessly scaling, Datastore and AppEngine seamlessly scaling.

This is especially obvious on the product I work on, BigQuery:

- We had a customer who did not do anything special, did not configure anything, didn't tell us, and ingested 4.5 million rows per second using our Streaming API for a few hours.

- We frequently find customers who scale up to 1PB-size without ever talking to us. I can be their first point of contact at Google.. after they're at that scale.

- Unlike traditional Databases, BigQuery lets you use thousands of cores for the few seconds your query needs them, and you only pay for the job. If I were to translate this to VM pricing, BigQuery gives you ability to near-instantly fire up thousands VMs, shut them down in 10 seconds, and only pay per-second. Customers like that kind of thing :)

Disclosure: Shamelessly biased

Wholeheartedly agree! Google Cloud is so severely underrated as a platform for scalable web-apps. If you use the cloud data store and web-app common sense, there is no re-architecting required for users in the range of 100->million+. And _much_ cheaper and lesser operational overhead compared to EC2/AWS. The disadvatange is that you have to use the Google stack and API's, but for new apps this is worth it.

Wonderful problem if you can get it :)

AWS is great and all (especially if you need a lot of CPU cycles), but this should come with the caveat that if you're under 1K users AWS probably isn't the best solution - conventional VPS hosting is usually more cost effective.

> if you're under 1K users AWS probably isn't the best solution - conventional VPS hosting is usually more cost effective.

You should amend that to say AWS EC2 isn't the best solution. Unless you've got some pretty high utilization (either CPU or bandwidth out) of that conventional VPS host, you can buy a lot of API Gateway/Lambda for the $10/mo you pay for your VPS host and get higher availability and scalability basically free.

I think you're dramatically underestimating the cost difference between AWS and other providers. Yes, you gain some reliability, but it's nowhere close to "basically free".

As a hypothetical example, let's say I have an API backend that needs 250ms of CPU to generate a 16KB response, and uses 512MB of memory. I can run this on a $9/month VPS [1] and, at full utilization, handle about 21 million requests per month.

Handling the same volume of requests on AWS Lambda is not just more expensive, but hugely more expensive. You end up paying about $4 in request charges, $73 for the "request gateway", $15 for the computation itself, and $30 for bandwidth. That's more than 13 times the cost, and I haven't even factored in data storage. You could buy two VPSes for fault-tolerance, hugely over-provision both of them, and you'd still end up spending less money than Lambda.

If your application is lightweight enough that even a single VPS is dramatically more than you need, then yeah, Lambda's pricing model could save you some of those last few dollars. But if you expect to grow, then you probably don't want to lock yourself into an API that will become much more expensive later on.

[1]: https://www.hetzner.de/en/hosting/produkte_vserver/cx20

Nit: You have to provision the VPS by peak usage, not dividing monthly usage evenly across the month. So if your peak is 13x the average (very easy, specially if you don't have a worldwide audience) the VPS starts to look bad, and we're not even talking about the risk of unexpected peaks.

Absorbing peak traffic was the original selling point of the "elastic" cloud. Sure, the cloud was more expensive, but you only had to pay for it for a few hours while traffic peaked. If traffic peaked multiple days in a row, then maybe it was time to rent a new dedicated server.

This is still the most economically sensible infrastructure strategy. Maintain a core group of dedicated servers responsible for a threshold workload. When they can no longer handle all incoming work, they offload the excess to temporarily provisioned cloud workers.

The benefits:

- Guarantee you are only getting price gouged by Amazon for a subset of your traffic

- Force yourself to build software that runs on multiple platforms

- Address scaling requirements up front

Perhaps most importantly, this strategy creates a profit incentive for increasing compute efficiency, regardless of Amazon's pricing structure. Every increase in software efficiency means that the same group of core servers can serve more requests, so you can pay less to Amazon.

Yeah, that's a fair point. Even so, I think there's only a short window in the life of a growing webapp where its baseline traffic is small enough for Lambda to make financial sense.

On the other hand, it looks like Lambda could be pretty great for small personal projects. It would be even better if they added a modest free tier to the request gateway, to match the other services.

Buy 2 machines then. Or 4 with 2 Nginx proxy pairs.

That's still less money and about 1000x the performance without the hassle of dealing with the API/Lambda development experience. Just deploy your webapp to 2 both instances without downtime and you'll be serving hundreds of thousands of users.

Amazon doesn't provide any extraordinary high-availability or reliability beyond what you can just do yourself. Their managed services are just running on their own private resources using the same AWS infrastructure, just with more money and people.

You might not be plugging in all inputs into your cost calculus -- namely, the amount of labor you spend reconfiguring your datacenter to accommodate change.

I'm fairly old school (been running Linux since the 90's and servers since not long after), Ansible (or something like it) and clean documentation is way cheaper (for me) than something like AWS in the general case.

With the big advantage of when something goes sideways I can actually debug the problem, for the scale of most of the systems we run one client per VPS with a backup for some is just fine (though we are transitioning the spares onto a different provider from the primary after Linode took a pasting).

Also looking at getting a couple of beefy dedicateds down the line and running Xen for the stuff we really need to not be wiped out.

AWS is excellent for a given set of trade-offs but if you have a good Ops background you can save some money which is nice but (for me) more crucially you can access your entire stack and move wherever you want.

My experience is that the labor involved in maintaining an AWS setup is typically far higher than the labor involved in maintaining a system on leased hardware or managed hosting, because you still need to deal with the fallout of most types of failures, but without insight into what's going on below the hood or ability to set up a system geared specifically towards your workload.

Mine as well but this is contingent on having people on hand who can open the hood and troubleshoot, if not and you are weak on the OPS side or earning so much per customer that hosting is a secondary consideration then I can see the value in AWS, it's just not my default choice.

Also frankly I loathe dealing with AWS's web interfaces for anything - frankly they are embarrassingly bad for a company that prides itself on end user experience.

If you don't have people on hand who can "open the hood and troubleshoot" I'd argue you don't have people that can run a service on AWS reliably. The number of gotchas I've run into with AWS is far higher than what I've had to deal with with managed hosting or even bare metal hardware.

(I'm assuming you're talking metaphorically, as for my part we use onsite repair warranties to deal with failure of new hardware, and just replace old hardware except when it's something very obvious like a failed drive - it's rarely worth the trouble to do a lot of diagnostics at smaller scales; in any case you can still save and avoid this by using a managed hosting provider)

Indeed, I've run owned bare metal but these days I rent them if I need them but largely VPS's suffice, also feel a lot more confident if something I set up develops a problem since its what you don't know that bites you at 3am.

This seems to be an unpopular opinion on HN, but you are correct. It is possible to generate millions in revenue with 1 or 2 devs. If you manage to do that, paying a higher than average price for AWS is a no brainer.

How much revenue you can generate per developer is totally irrelevant. If you generate millions in revenue but server costs eats it all up, paying a 3x+ premium to run on AWS can easily bankrupt you. By all means, if your server costs are inconsequential to your bottom line, go nuts.

I've just moved a client off EC2 because the premium they were paying would have been a massive problem. The 85% reduction in hosting cost has bought them months of extra runway. Their operational costs related to their hosting also dropped - there's simply been fewer issues to deal with.

I'm sure there are instances where AWS is fine. But there are also plenty of cases where it is a matter of survival to cut those costs.

All good points. I should have been more specific. You can generate > $1M in profit with 1 or 2 devs, and in that case, AWS is a no brainer. In my experience, it is much more difficult to manage dedicated hardware in multiple data centers for high availability with only 1 or 2 devs. The opportunity costs alone in that case can kill you.

But I don't live in a world where runway is a consideration so YMMV. At the time I commented, the parent post was getting downvoted. I've seen that knee jerk reaction on HN multiple times, and that is what prompted my comment.

I know Whatsapp is the poster child for this sort of thinking, but how many other companies generate millions with just a couple of devs?

Origin systems and Id Software did for years, Plenty of Fish had one dev, Minecraft, Stack Overflow, Instagram, Flappy Bird... there have been a lot, and it's probably getting more common in recent years.

It's kind of hard to get numbers though since most private companies don't trumpet their revenue numbers or engineering headcount.

This is a great article!

I see a lot of pessimism about AWS in this thread but its unfounded.

The sheer number of success stories on AWS at every scale is amazing. This guide demonstrates the diverse set of services AWS offers for customers from zero to Netflix. AWS is world-class engineering and operations that can be summoned by a single API call.

There might be ways to cut monthly costs on other providers, but many people forget to factor in your time to research, design stand up and operate software. I'd go all in on SQS, with all it's design quirks and potential costs, over rolling my own RabbitMQ cluster on Digital Ocean any day.

I'm biased, working full time on open source tools to help beginners on AWS at Convox (https://github.com/convox/rack), but frankly there's not a better time to build and scale your business on AWS. The platform is pure productivity with very little operational overhead.

> AWS is world-class engineering and operations that can be summoned by a single API call.

Are they still doing world-class ICMP filtering, breaking PMTUD?

There's actually an account on Medium - AWSActivate which publishes a lot of useful stuff like this. Check it out - https://medium.com/@awsactivate

It would be cool if they would show the range of costs ($$$) for each step of growth. My fear is that if you do everything by the book the costs correlate with growth.

It would also be interesting to see that as a rough $$$/user. It would be very interesting to see how much you need to be making from each user to cover hosting.

I did this migration recently and we're spending about 1.75 cents per user. We could do it for cheaper, but we've recently had some issues that were absolutely trivial to resolve with AWS, that would have been very difficult with our previous hosting provider.

This hits on something in the calculation that I feel is very hard to factor in, the cost of development time. Sure, there are plenty of ways to do these things cheaper on a hardware/software cost per user basis, but more often than not I've found that we can get changes out so much faster in AWS that you're easily saving thousands in developer time, which would seem to more than cover the extra cost to me.

Per month, I take it?


i run an infrastructure startup.

the rule of thumb is once you hit $20-99k/month, you can cut your AWS bill in half somewhere else. sites in this phase generally only use about 20% of the features of aws.

the other rule of thumb is once you hit six figures/month, you're probably spending someone else's money, are locked in to their stack, or just don't really care to begin with, so there's no point in telling/selling you otherwise.

I would argue that you need monitoring significantly sooner than 500,000 users. I guess, until then, you just use Twitter noise for monitoring? Seems like pretty bad customer experience.

If I have something in an environment that I would start to consider "production" (i.e. someone relies on my product to do something regularly), then I'd have monitoring regardless of the number of users. Even something as simple as, "Am I returning valid data from GET /"?

A lot of comments in this thread are voicing concerns over the marketed cost/performance benefits of AWS and the reliability of their services in the case of region failure e.g. the API services goes down.

But are there benefits to using Amazon's more high-level services such as SQS and SNS which, supposedly, replicate their configuration state and data across multiple regions, in terms of reliability?

For instance, on a per-instance basis AWS might be more expensive than a bare-metal provider, and there's nothing to stop you running your own RabbitMQ instance. But SQS messages are replicated across three regions, so if you were building an equivalent service you'd need three instances in different regions and a reliable distributed message queue.

So does that additional complexity/cost make SQS at all worthwhile? Or does it come down to the fact that, while your own hand-rolled service would require more management, your potential message throughput at a given cost would be much higher than with SQS?

There is a lot of pessimism about AWS in here. Does anyone have a link to a similar article from the roll-your-own perspective? I am comfortable writing small Python web apps (i.e. running on a single instance with SQL server on the same box), but scaling on my own is a mystery to me at this point.

Etsy's blog has some good posts [1].

[1] https://codeascraft.com/2012/03/13/making-it-virtually-easy-...

I gotta wonder why they want to start splitting things up at only 10 users. Unless your uses are really active all day and you have a lot of very processor-intensive stuff going on, I wouldn't think you need that until well over 1000s of users.

As with almost everything like this, "users" is a completely undefined term and the service could be anything. If all you want to do is serve wordpress or whatever, then sure this kind of cookie cutter approach is no problem, but for most bespoke web services or business infrastructures you pretty much just have to analyse all thise stuff yourself and figure out the most cost effective way to do it all.

Coming from an environment that uses lots of AWS resources to handle scaling requirements across different kinds of workloads on different linked accounts, one of the challenges we faced was to communicate and collaborate efforts and its impacts on cost efficiency. Typically our best environment isn’t the product of a singular design effort at the individual level, but many times emergent based on differing opinions and trials to assert assumptions in practice. We built a tool, https://liquidsky.singtel-labs.com, to help with this.

I've configured my web application to deploy to S3/Cloudfront for asset deployment. It's a PHP app.

In the end, I might just pay a little more for a faster server. Keep things simple, everthing on the one app.

It's a "normal" app (in the grand scheme of the Internet), so 10 users at a time would be high traffic already.

10 users? You want a $5 DigitalOcean, a $10 Linode, or similar. A single server can handle a lot more than 10. There's a trend on HN obsessed with high availability and scalability that makes it sound like every website needs to be extremely resistant to any failures. The majority of websites need no such thing. If you're spending more than $50/month on a very small website, you are more than over-engineering the requirements.

Thanks. It is on a $10 Linode. But currently uses Amazon Cloudfront to server most assets, which is overkill. It costs like $0.50, but it's the extra engineering complexity that I'd like to avoid.

I agree with you.

Absolutely. Amazon is selling magic beans[0], quite often. They have lots of tools that they convince new engineers are the best. But quite often, if not always, the existing FOSS tools (upon which most of AWS is built and from whence they came) offer superior performance at a far better price point for most scale.

In tribute to the Dead Milkmen, in case you want to sue me, I'm talking about this book - http://www.amazon.com/Magic-Beans-Nutrient-Rich-Disease-Figh...

I think AWS doesn't go for "superior performance at a far better price point for most scale". They go for, our solutions are a click away, and take far less time to setup then rolling your own. You know because engineer costs are the biggest cost component really.

Once your at a large enough scale, then yes engineer costs become a smaller component and becomes worth it.

Correct. But I believe they sell and market it this way. Their best value is for companies that want burstability, convenience and/or have dysfunctional organizations that have slow internal expensive bureaucracy.

If the PHP responses are relatively static, adding a bit of caching in front of it will improve the responsiveness and decrease the load dramatically. Simply adding a 5 minute cache let us scale one PHP application from 100 concurrent to "SSL & gzip require more CPU than PHP". We figured that was sufficient.

More dynamic applications (like a commenting system) might feel better at 10-30 seconds of caching with expiration commands, but it will still help scale up significantly.

By caching, you mean like script execution caching that PHP accelerators give? https://en.wikipedia.org/wiki/List_of_PHP_accelerators

Am I right in thinking that such caching comes built-in with PHP 5.5+ ?

Look at Varnish Cache https://www.varnish-cache.org/ and Google's PageSpeed module on the server. https://developers.google.com/speed/pagespeed/?hl=en

Thanks. That's quite added complexity in my scenario, which I think I would avoid.

Nginx and Apache have built in caching which can usually be easily enabled, which while arguably not as fast as using Varnish (Nginx in particular will serve cached content from disk using sendfile, as opposed to Varnish's in-memory caching) are still faster than calling back into PHP.

PageSpeed is 1 liner installation on your instance. It will compress assets etc automatically as Apache serves them. Worth adding to your deployment script.

Nice. I like how it could be installed in Apache, and then left to its own devices.

"users" is a bad metric. How many requests are you getting?

You can run wordpress (a fairly unoptimized app) on a tiny linux VM and easily serve 50 requests per second. That's 4M requests over 24 hours.

If you need more than that, just upscale your server. 1 midsize server these days can handle 100M requests per day without a problem if it's just running a basic site.

Would you bother with CDN delivery for Wordpress at 50 requests per second? It would speed up delivery for users, but I suspect it passes my level off "too complex for current situation".

I believe it's still worth it. Using a CDN will definitely help speed up the delivery of your static assets especially to those who are further away from your origin server. They're also quite simple to set up as there are many Wordpress plugins out there that allow you to simply enter your CDN url which will rewrite your current static asset URLs (e.g. CDN Enabler).

Using a pay-as-you-go CDN service would likely be the way you would want to go just so that you aren't tied down to any monthly commitment that you may not end up fully using.

I would suggest taking a look at KeyCDN (https://www.keycdn.com/) which is quite affordable.

Depends on how much you care about your users but yes, I would.

CDN's are very cheap and easy to setup. No big contracts or commitments these days. You can use them just for the static assets or for your entire site to make it faster for everyone while also reducing requests to your origin server.

MaxCDN is cheap and effective or you can use CloudFlare and get their security features too and not worry about bandwidth.

IMHO many companies save time, money or both using AWS. Others fail miserably trying to do so.

I like very much the Amazon's AWS. I use them extensively. But apparently some folks goes a little crazy to adopt cloud services as final solution for every use case. They have no idea how much traffic a real high-end server fully loaded with memory and SSD disks should handle these days.

video of this material here https://www.youtube.com/watch?v=vg5onp8TU6Q

> Users > 1,000,000+


> Put caching in front of the DB

Isn't that a little late?

Not really. SQL DBs can handle a crapload of traffic. Maybe not a million all at once by default, but generally with a million users you're looking at << 50k on site at any given time, and if you split reads off to replicas you can handle a lot of scale. In my experience, 50-100k qps (writes) is where SQL starts to get especially hard

11m+ isn't scale. 111m+ is scale.

  Start with SQL and only move to NoSQL when necessary.

  Users > 10.000.000+:
    Moving some functionality to other types of DBs (NoSQL, 
    graph, etc)
Interesting insights from Amazon. While not everyone will agree, there is apparently some truth in it.

The isn't usually a good reason to start with a NoSQL solution, except for buzzwords on your CV.

Or the fact there data sets that fit nosql databases seem to work far better.

Patient records is one I can think off.

These data sets can be easily handled by Postgresql's JSONB data type.

Or a normal table.

Normal tables don't elegantly handle certain types of data. I'm not saying you can't make it work, but there's a valid reason why people choose to use document stores over traditional tables in certain cases.

Can but its not the best. This is obvious since most patient record systems these days do not use SQL. They use things like MUMPS.

How much would it cost Amazon to run Amazon.com on AWS?

(Amazon.com retail website runs on EC2 and AWS since 2010)

I'd be surprised if Amazon didn't run on AWS.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact