
A Beginner's Guide to Scaling to 11M+ Users on Amazon's AWS - dsr12
http://highscalability.com/blog/2016/1/11/a-beginners-guide-to-scaling-to-11-million-users-on-amazons.html
======
napkindrawing
I work in the entertainment / ticketing industry and we've been burned badly
before by relying on AWS' Elastic Load Balancer due to sudden & unexpected
traffic spikes.

From the article: "Elastic Load Balancer (ELB): [...] It scales without your
doing anything. If it sees additional traffic it scales behind the scenes both
horizontally and vertically. You don’t have to manage it. As your applications
scales so is the ELB."

From Amazon's ELB documentation: "Pre-Warming the Load Balancer: [...] In
certain scenarios, such as when flash traffic is expected [...] we recommend
that you contact us to have your load balancer "pre-warmed". We will then
configure the load balancer to have the appropriate level of capacity based on
the traffic that you expect. We will need to know the start and end dates of
your tests or expected flash traffic, the expected request rate per second and
the total size of the typical request/response that you will be testing."

~~~
no1youknowz
You'd be surprised about how many people don't know this. I had an expectation
to scale past 1B users. I was trialling AWS when I realised through testing
that it was this way. It could not deal with sudden spikes of traffic.

Suffice to say, I went elsewhere.

~~~
pjc50
A _billion users_? Are you Facebook or the Olympics?

~~~
no1youknowz
Neither. But once you start doing something like serving ads. The paradigm
shifts. Of course, what I do is a lot more intensive/complex. But I'll say
this to get the basics across.

------
falcolas
Hoo boy. Here we go. The problem with AWS reps is that they only see
everything as working perfectly, with no possibility for downtime of their
services.

RDS is great, but only to a certain level. You'll still need to pull it off
RDS once you reach that service's capacity (much sooner than their 10m user
mark). They also keep pushing Aurora, but without telling us what the
tradeoffs are for the high availability. Based on the responses so far (MySQL
backed by InnoDB), it appears to be based on a technology similar to Galara,
which has a lot of caveats for its use, especially with multiple writers.

Don't depend on Elastic Scaling for high availability - when an AZ is having
issues, the AWS API will either be down or swamped, so you want to have at
least 50% extra capacity at all times, if you want high availability.

Using their scaling numbers, your costs start spiking at 10 users.
Realistically, with intelligent caching (even something as simple as Nginx
caching), you can easily support several thousand users just fine with a t2
style instance, either a small or micro. Splitting services onto different
hosts not only increases your hosting costs, it increases the workload on your
developers/admins and likeliness of failure.

DR: Don't wait until you have over a thousand users to have multiple instances
in different AZs. The cost of duplicating a t2.small across an AZ is small
compared to lost users or sales.

Automation: Be prepared for vendor lockin if you use Amazon's solutions. Also
be prepared for their APIs being unavailable during times of high load or
during AZ failures.

> Lambda [...] We’ve done away with EC2. It scales out for you and there’s no
> OS to manage.

The biggest problem with Lambda right now are the huge latency costs with cold
lambda instances. You'll get a pretty good 95% percentile response times, but
that other 5% will be off-the-chart bad.

In summary, AWS has a lot of great toys, and can absolutely be used for
scaling up to silly levels. However, most who have done this degree of scaling
do not do so using AWS tools.

~~~
LoSboccacc
> Realistically, with intelligent caching (even something as simple as Nginx
> caching), you can easily support several thousand users just fine with a t2
> style instance

agreed, the article approach to scalability is to throw silly amounts of money
at the problem, instead of going for an architecture to squeeze first every
bit of performance out of the app. true this approach is pretty simple and
works for any kind of application, but the RDS will hit connections cap quite
fast if on just throws instances at the problem.

edit: yep, just noticed this comes from a Amazon Web Services Solutions
Architect, of course the solution is to throw money at them

~~~
falcolas
> of course the solution is to throw money at them

Yup. They put out a white paper at one point on surviving DDOS attacks on AWS
which amounted to "out-scale the attack". AKA the Wallet based DDOS.

------
novaleaf
I went with Google Cloud, and my 1 to 10 user infrastructure is the same as
1million+ users:

1) Use Load Balancer + Autoscaler for all service layers. This effectively
makes each layer a cloud of on-demand microservices.

2) Use Cloud Datastore: (NoSql) Maybe I lucked out that I don't have complex
relational data to store, but Cloud Datastore abstracts out the entire DB
layer, so I don't have to worry about scaling/reliability ever.

... aside from random devops stuff, that's pretty much it. The key point is to
"cloudify" each layer of the infrastructure.

~~~
vgt
This story doesn't get told enough.

Most of Google Cloud is built to operate the same way with 1 user or 1m users.
And in many cases, Google doesn't charge you for the "scaling vector", whereas
AWS will, and will sometimes even require a separate product (see Firehose).

Things like Load Balancer not requiring pre-warming, PubSub seamlessly
scaling, Datastore and AppEngine seamlessly scaling.

This is especially obvious on the product I work on, BigQuery:

\- We had a customer who did not do anything special, did not configure
anything, didn't tell us, and ingested 4.5 million rows per second using our
Streaming API for a few hours.

\- We frequently find customers who scale up to 1PB-size without ever talking
to us. I can be their first point of contact at Google.. after they're at that
scale.

\- Unlike traditional Databases, BigQuery lets you use thousands of cores for
the few seconds your query needs them, and you only pay for the job. If I were
to translate this to VM pricing, BigQuery gives you ability to near-instantly
fire up thousands VMs, shut them down in 10 seconds, and only pay per-second.
Customers like that kind of thing :)

Disclosure: Shamelessly biased

------
zdw
AWS is great and all (especially if you need a lot of CPU cycles), but this
should come with the caveat that if you're under 1K users AWS probably isn't
the best solution - conventional VPS hosting is usually more cost effective.

~~~
otterley
You might not be plugging in all inputs into your cost calculus -- namely, the
amount of labor you spend reconfiguring your datacenter to accommodate change.

~~~
griffordson
This seems to be an unpopular opinion on HN, but you are correct. It is
possible to generate millions in revenue with 1 or 2 devs. If you manage to do
that, paying a higher than average price for AWS is a no brainer.

~~~
vidarh
How much revenue you can generate per developer is totally irrelevant. If you
generate millions in revenue but server costs eats it all up, paying a 3x+
premium to run on AWS can easily bankrupt you. By all means, if your server
costs are inconsequential to your bottom line, go nuts.

I've just moved a client off EC2 because the premium they were paying would
have been a massive problem. The 85% reduction in hosting cost has bought them
months of extra runway. Their operational costs related to their hosting also
dropped - there's simply been fewer issues to deal with.

I'm sure there are instances where AWS is fine. But there are also plenty of
cases where it is a matter of survival to cut those costs.

~~~
griffordson
All good points. I should have been more specific. You can generate > $1M in
_profit_ with 1 or 2 devs, and in that case, AWS is a no brainer. In my
experience, it is much more difficult to manage dedicated hardware in multiple
data centers for high availability with only 1 or 2 devs. The opportunity
costs alone in that case can kill you.

But I don't live in a world where runway is a consideration so YMMV. At the
time I commented, the parent post was getting downvoted. I've seen that knee
jerk reaction on HN multiple times, and that is what prompted my comment.

------
nzoschke
This is a great article!

I see a lot of pessimism about AWS in this thread but its unfounded.

The sheer number of success stories on AWS at every scale is amazing. This
guide demonstrates the diverse set of services AWS offers for customers from
zero to Netflix. AWS is world-class engineering and operations that can be
summoned by a single API call.

There might be ways to cut monthly costs on other providers, but many people
forget to factor in your time to research, design stand up and operate
software. I'd go all in on SQS, with all it's design quirks and potential
costs, over rolling my own RabbitMQ cluster on Digital Ocean any day.

I'm biased, working full time on open source tools to help beginners on AWS at
Convox ([https://github.com/convox/rack](https://github.com/convox/rack)), but
frankly there's not a better time to build and scale your business on AWS. The
platform is pure productivity with very little operational overhead.

~~~
zAy0LfpBZLC8mAC
> AWS is world-class engineering and operations that can be summoned by a
> single API call.

Are they still doing world-class ICMP filtering, breaking PMTUD?

------
krat0sprakhar
There's actually an account on Medium - AWSActivate which publishes a lot of
useful stuff like this. Check it out -
[https://medium.com/@awsactivate](https://medium.com/@awsactivate)

------
edvinasbartkus
It would be cool if they would show the range of costs ($$$) for each step of
growth. My fear is that if you do everything by the book the costs correlate
with growth.

~~~
boothead
It would also be interesting to see that as a rough $$$/user. It would be very
interesting to see how much you need to be making from each user to cover
hosting.

~~~
chucky_z
I did this migration recently and we're spending about 1.75 cents per user. We
could do it for cheaper, but we've recently had some issues that were
absolutely trivial to resolve with AWS, that would have been very difficult
with our previous hosting provider.

~~~
snaily
Per month, I take it?

~~~
chucky_z
Correct.

------
grepory
I would argue that you need monitoring significantly sooner than 500,000
users. I guess, until then, you just use Twitter noise for monitoring? Seems
like pretty bad customer experience.

If I have something in an environment that I would start to consider
"production" (i.e. someone relies on my product to do something regularly),
then I'd have monitoring regardless of the number of users. Even something as
simple as, "Am I returning valid data from GET /"?

------
clentaminator
A lot of comments in this thread are voicing concerns over the marketed
cost/performance benefits of AWS and the reliability of their services in the
case of region failure e.g. the API services goes down.

But are there benefits to using Amazon's more high-level services such as SQS
and SNS which, supposedly, replicate their configuration state and data across
multiple regions, in terms of reliability?

For instance, on a per-instance basis AWS might be more expensive than a bare-
metal provider, and there's nothing to stop you running your own RabbitMQ
instance. But SQS messages are replicated across three regions, so if you were
building an equivalent service you'd need three instances in different regions
and a reliable distributed message queue.

So does that additional complexity/cost make SQS at all worthwhile? Or does it
come down to the fact that, while your own hand-rolled service would require
more management, your potential message throughput at a given cost would be
much higher than with SQS?

------
aganders3
There is a lot of pessimism about AWS in here. Does anyone have a link to a
similar article from the roll-your-own perspective? I am comfortable writing
small Python web apps (i.e. running on a single instance with SQL server on
the same box), but scaling on my own is a mystery to me at this point.

~~~
greenleafjacob
Etsy's blog has some good posts [1].

[1] [https://codeascraft.com/2012/03/13/making-it-virtually-
easy-...](https://codeascraft.com/2012/03/13/making-it-virtually-easy-to-
deploy-on-day-one/)

------
ufmace
I gotta wonder why they want to start splitting things up at only 10 users.
Unless your uses are really active all day and you have a lot of very
processor-intensive stuff going on, I wouldn't think you need that until well
over 1000s of users.

~~~
chillydawg
As with almost everything like this, "users" is a completely undefined term
and the service could be anything. If all you want to do is serve wordpress or
whatever, then sure this kind of cookie cutter approach is no problem, but for
most bespoke web services or business infrastructures you pretty much just
have to analyse all thise stuff yourself and figure out the most cost
effective way to do it all.

------
vitoc
Coming from an environment that uses lots of AWS resources to handle scaling
requirements across different kinds of workloads on different linked accounts,
one of the challenges we faced was to communicate and collaborate efforts and
its impacts on cost efficiency. Typically our best environment isn’t the
product of a singular design effort at the individual level, but many times
emergent based on differing opinions and trials to assert assumptions in
practice. We built a tool, [https://liquidsky.singtel-
labs.com](https://liquidsky.singtel-labs.com), to help with this.

------
miseg
I've configured my web application to deploy to S3/Cloudfront for asset
deployment. It's a PHP app.

In the end, I might just pay a little more for a faster server. Keep things
simple, everthing on the one app.

It's a "normal" app (in the grand scheme of the Internet), so 10 users at a
time would be high traffic already.

~~~
developer2
10 users? You want a $5 DigitalOcean, a $10 Linode, or similar. A single
server can handle a _lot_ more than 10. There's a trend on HN obsessed with
high availability and scalability that makes it sound like every website needs
to be extremely resistant to any failures. The majority of websites need no
such thing. If you're spending more than $50/month on a very small website,
you are more than over-engineering the requirements.

~~~
miseg
Thanks. It is on a $10 Linode. But currently uses Amazon Cloudfront to server
most assets, which is overkill. It costs like $0.50, but it's the extra
engineering complexity that I'd like to avoid.

I agree with you.

------
meirelles
IMHO many companies save time, money or both using AWS. Others fail miserably
trying to do so.

I like very much the Amazon's AWS. I use them extensively. But apparently some
folks goes a little crazy to adopt cloud services as final solution for every
use case. They have no idea how much traffic a real high-end server fully
loaded with memory and SSD disks should handle these days.

------
morenoh149
video of this material here
[https://www.youtube.com/watch?v=vg5onp8TU6Q](https://www.youtube.com/watch?v=vg5onp8TU6Q)

------
chinathrow
> Users > 1,000,000+

[...]

> Put caching in front of the DB

Isn't that a little late?

~~~
bpicolo
Not really. SQL DBs can handle a crapload of traffic. Maybe not a million all
at once by default, but generally with a million users you're looking at <<
50k on site at any given time, and if you split reads off to replicas you can
handle a lot of scale. In my experience, 50-100k qps (writes) is where SQL
starts to get especially hard

------
VOYD
11m+ isn't scale. 111m+ is scale.

------
frik

      Start with SQL and only move to NoSQL when necessary.
    
      Users > 10.000.000+:
        Moving some functionality to other types of DBs (NoSQL, 
        graph, etc)
    

Interesting insights from Amazon. While not everyone will agree, there is
apparently some truth in it.

~~~
collyw
The isn't usually a good reason to start with a NoSQL solution, except for
buzzwords on your CV.

~~~
UK-AL
Or the fact there data sets that fit nosql databases seem to work far better.

Patient records is one I can think off.

~~~
billmalarky
These data sets can be easily handled by Postgresql's JSONB data type.

~~~
collyw
Or a normal table.

~~~
billmalarky
Normal tables don't elegantly handle certain types of data. I'm not saying you
can't make it work, but there's a valid reason why people choose to use
document stores over traditional tables in certain cases.

------
frik
How much would it cost Amazon to run Amazon.com on AWS?

(Amazon.com retail website runs on EC2 and AWS since 2010)

~~~
blahshaw
I'd be surprised if Amazon didn't run on AWS.

