
A Beginner's Guide to Scaling to 11M Users on Amazon's AWS (2016) - febin
http://highscalability.com/blog/2016/1/11/a-beginners-guide-to-scaling-to-11-million-users-on-amazons.html
======
chmod775
Here's some anecdata from someone serving 1.5 million monthly users with a
realtime (chat etc.) web app. Disregarding the file serving functionality,
this is the entire infrastructure:

\- The frontend is running on a 30 bucks/mo VPS with cycles and RAM to spare.
\- There's 2 DB servers for 5 bucks / mo for redundancy.

This thing would scale pretty linearly in theory, so for 10 million users just
add like 4 servers.

So about 200 bucks / mo to run your average startup's web app at 10 mil users.

Alternatively write bad software and throw 50k/mo at AWS I guess.

~~~
stygiansonic
What’s the typical QPS like?

~~~
chmod775
Since it's essentially an one-page app I can do a lot via persistent websocket
connections (of which there are usually ~25k concurrently).

------
altmind
> Amazon S3 is an object base store. > Highly durable, 11 9’s of reliability.

My eye twitched when I read this, considering that I remember more than one
aws outage that affected s3 availability/reliability.

AWS own pages mention 10 nines for /durability/, however, I doubt this
durability number, this level is almost unachievable by any end customer
product. Could it be some clever durability accounting?

I'm quite disappointed on AWS service status reporting when the service may be
disrupted so so badly, to the point of not being serve a single request,
however status pages not mentioning problems at all, or just reporting
"increased error rates".

~~~
antonvs
The quoted 11 nines durability number is saying that you won't lose your data.
Availability has nothing to do with it. It seems the OP used the wrong word
and should have said "durability" instead of "reliability".

> this level is almost unachievable by any end customer product.

Not sure what your point is here, unless you're highlighting the benefits of
using a cloud provider. AWS is not using an "end customer product" to
implement S3. They're providing a managed service, and their implementation
and management is what allows it to achieve those levels of durability.

------
bufferoverflow
One important thing it doesn't discuss is the insane AWS costs compared to the
other clouds and even regular self-managed solutions running on cheap VPSes
and dedicated servers.

~~~
InGodsName
When you get network level ddos on digital ocean, they can't save you.

This rules out small clouds for us.

So, we use GCP Azure AWS exclusively because of their ability to defy network
level ddos.

~~~
altmind
How do you manage to mitigate a DDOS on a public cloud?

When you say GCP, Azure and AWS have the capacity to defy the ddos, what
capacity are you reffering to?

Are you talking about actually scaling and serving the bogus requests? Or
capacity to have enoough bandwith and firewall power to fend it off?

~~~
aynsof
I can only talk for AWS, not GCP or Azure, but there are services that can
help mitigate DDOS attacks:

Shield ([https://aws.amazon.com/shield/](https://aws.amazon.com/shield/)) is
AWS's DDOS protection service. It's free and is basic protection against L3/4
attacks.

Shield Advanced (same URL as above) is a big step up in price, but gives you
access to 'improved protection' and a global response team.

Cloudfront
([https://aws.amazon.com/cloudfront](https://aws.amazon.com/cloudfront)) is a
CDN with global edge locations.

WAF ([https://aws.amazon.com/waf](https://aws.amazon.com/waf)) is AWS's web
application firewall service. It's less about DDOS and more about specific
application attacks, but is part of the whole solution.

For more detail, you can have a look at AWS's DDOS whitepaper:
[https://d0.awsstatic.com/whitepapers/Security/DDoS_White_Pap...](https://d0.awsstatic.com/whitepapers/Security/DDoS_White_Paper.pdf)

------
NightlyDev
And just remember, public clouds are expensive. If you need to handle millions
of requests per second then there is a very high probability that you can
afford a dedicated infrastructure team and save a ton of money while working
just as well.

Just look at the insane(!!!) bandwidth costs...

~~~
fma
I'm a bit naive when it comes to infrastructure, since I've never managed it
myself. I've always been in companies with dedicated infrastructure :).

Why does Netflix outsource to AWS?

The gripes I've had with dedicated infrastructure team is their
knowledge/experience can be hit or miss. My current company is about saving
money, so sometimes there's junior guys on there. Or there's guys who need to
handle 20 different products, they're good at 10..but suck at the other 10. I
personally don't believe they have best practices (but I'm not an
infrastructure expert), because downtime happens.

~~~
sokoloff
AFAIK, Netflix does not _stream_ out of AWS, but instead "just" runs their
application services on AWS (which is still a significant workload).

I seem to recall they offered a rack full of their own caching equipment to
medium and larger ISPs (to save both sides streaming costs while providing
better service to Netflix customers). Basically a private app-specific CDN.

~~~
fredsted
It's true. If you open Web Inspector while streaming a Netflix movie, the URL
it's getting the stream from probably contains your ISP name and region!

------
sethammons
They say "you need to add monitoring, metrics and logging" at the 500k user
mark. That should be happening with user 0 with few exceptions that I can
think of.

~~~
Raidion
I feel like this is the reason stuff doesn't get shipped. If you couldn't
release v.01 without adding in a robust analytics suite, nothing would ever
get released.

Nothing wrong with releasing a .01v and not putting priority on anything but
basic functionality until you start to grow. Half of this means you need to
build out functionality that doesn't require heavy monitoring, metrics, and
logging to get the job done. All that infrastructure stuff is more fun to code
when you've blown through the free tier of S3 storage anyway.

~~~
sethammons
Nothing heavy needed. You should know that your thing is working. A basic
health check. You should know what errors are happening with basic logging. As
soon as you have a paying user, you should have some form of reporting or
alerting on this. You should have some basic metric(s) in place very early on
like requests per second or sign ups. This should take you in the order of
hours, not days, to set up. Nothing heavy. But without basic visibility into
your system, you are asking for problems.

~~~
komali2
What tools would you use to set this up?

~~~
alexdias
Depends on your stack and requirements (do you want to know about errors ASAP,
or is a 2-5 minute delay ok?), but I personally love NewRelic because of how
easy it is to set up (and the number of features that it has).

If you want tools that you can manage yourself, then a combination of StatsD +
Grafana for metrics, and Sentry for errors. For logs, Graylog if you want to
set up something complicated but powerful, and syslog-ng if you just want to
dump the logs somewhere (so you can simply grep them).

~~~
late2part
Most of these tools cost too much to scale past 1M users.

~~~
throwaway98121
You could write your own service... some thin agent that runs on your boxes
and dumps the files every hour to some storage optimized boxes (your data
lake)... where another process picks up those files periodically (or by
getting a notification that something new is present) and loads them into a
Postgres instance (you actually probably want column oriented).

Running every hour, you won’t get up to the second or minute data points. For
more critical sections of code, maybe have your code directly log to an agent
on the machine that periodically flushes those calls to some logging service.

~~~
late2part
1\. Collectd and Graphite serve this well or can be modified

2\. Nine of these commercial services gives you per second granularity that I
have seen

~~~
late2part
s/Nine/None/

#if there are 9 i'd like to know of them :-)

------
packetslave
This is a written summary (like all High Scalability articles) of this video:
[https://www.youtube.com/watch?v=vg5onp8TU6Q](https://www.youtube.com/watch?v=vg5onp8TU6Q)

------
ac29
Some of this advice seems like wildly premature optimization. Splitting into
multiple hosts and/or using a database service at 10 users?

At 10 users, you should be focusing on growth, not fussing with your tech
stack.

------
munns
I wrote the first version of this deck back in 2013:
[https://www.slideshare.net/AmazonWebServices/scaling-on-
aws-...](https://www.slideshare.net/AmazonWebServices/scaling-on-aws-for-the-
first-10-million-users) and it since then went on to become one of the most
delivered talks by AWS across the world. I thought I'd pop into this thread
and answer some questions/help provide guidance as to why it exists. Note that
first deck was mostly tossed together while on an Amtrak from NYC to Philly in
an early morning, so excuse the typos and graphics glitches.

Generally speaking we've/I've seen a number of customers struggle with pre-
optimization of their infrastructure and/or just not knowing what they should
think about and when, when it comes to scale and architecture patterns. Think
of this deck as an 80/20 rule for the most general and basic of people. Also,
2013 was a very very different world in the cloud and in infrastructure than
today.

This deck was all pre-Docker/k8s, pre-Lambda/FaaS, pre-alot-of-cool-db-tech.
However as a general starting point a lot of it still holds just fine. You
_should_ probably start with a relational DB if you have a more traditional
background using them. You _probably_ don't need autoscaling Day 1 (but it's
nice via things like ElasticBeanstalk).

Someone commented that metrics/logging/monitoring is a Day 0 thing, and it
absolutely is, but it is the kind of thing most companies skimp out on until
very late in the game. Not pushing for excellence in this area will block you
from success down the line. The same is hiring someone with dedicated Ops/SRE
responsibilities, and/or making sure someone on the team has the right
experience in that space.

DB growth/scaling is now a lot more transparent than it used to be, but I
still see people doing less than ideal things with them. This is an area the
industry needs to be sharing more best practices on (while still respecting
some of the best practices from decades of SQL).

On costs; today some of the biggest sites in the world run on a public cloud.
They do that for many reasons, but cost effectiveness at scale is pretty well
proven when measured the right way. Most math ignores people and opportunity
cost and that's what get's you more than metal and power. There is also now
today the wealth of managed services that equate to replacement of people with
a returned value greater than the people cost (essentially the ROI on managed
services outpaces the value of you doing it yourself greatly). The SaaS
ecosystem is also vastly larger than it was in 2013. I hope to never run a
logging cluster, monitoring systems, backup tools, etc, myself again.

Anyway, glad to see this deck still kicks around and that people are
discussing it. Happy to try and answer more questions on this. - munns@

------
karmakaze
A glaring omission in AWS offerings is a NewSQL database that let's you start
with SQL, have redundancy and scale horizontally having consistent reads the
whole time. Scaling to 10M+ users and then switching datastores makes for good
story writeups because it's hard. It's also avoidable.

~~~
antonvs
Doesn't Aurora fit that bill?

~~~
karmakaze
Aurora has good replication but it's still just master/slaves rather than
sharded. DynamoDB is sharded but nosql.

~~~
antonvs
Thanks, wow, I guess I just never dug deep enough past the Aurora marketing,
which I see now actually talks about its "distributed storage system".

------
squigs25
I don't think it's necessarily a good idea to use SQL for everything. If you
are constantly inserting records and don't need relational structures,
DynamoDB might be a better choice. DynamoDB scales quickly and cheaply, so
it's potentially the preferable option if it works in your architecture.

~~~
zwkrt
It's relatively easier to iterate your data storage with relational tables
though. Dynamite is great if you are sure a component just needs a k/v store,
and you are sure you will never want to do a join.

------
debarshri
I think the guide does not work when you have less users and more data, for
instance if you are a b2b saas provider. I think the way you architect an
infrastructure should be driven by the value of the user.

------
bradknowles
(2016)

~~~
ThrowMeDown01
Does anybody have an equally short summary of any differences today?

By the way, there are links below the text, here's the one to the HN
discussion of 2016:
[https://news.ycombinator.com/item?id=10885727](https://news.ycombinator.com/item?id=10885727)

~~~
InGodsName
Today we use

For database, we use RDS/Dynamodb

Redis for cache.

Dynamodb is better in cases where we want to localize the latency of our
regional lambdas.

RDS for everything else like dashboard entity storage etc..

Cloudwatch prints logs, kinesis takes the log to s3 where it's transformed in
batches with lambda then data is moved to Redshift. Redshift for
stats/reports.

Converted whole ad network to Serverless. Used Rust Lambda runtime for CPU
intensive tasks.

Using Go for the rest of the Lambdas.

I love Go and Rust and optimizing just one Lambda at a time brought back the
programme joy in my life

Used Apex+Terrform to manage whole infra for the ad network.

We managed to deploy Lambda in all AWS regions ensuring minium latency for the
ad viewers.

The thing which took over 50 (tech side only) person team, now takes 10 people
to run the whole ad network.

Network is doing 20M profit per year/9 billion clicks per day.

It's my greatest achievement so far because we made it from scratch without
any investor money.

But one the other side, we'll have to shrink our team size next year as growth
opportunity is limited and we want to optimize efficiency further.

~~~
k__
Pretty awesome!

I'm currently planing to write a book about AWS. It should teach people how to
build their MVPs without restricting themselves in the future.

Are you available for an interview in January?

~~~
schnevets
I've been looking for a book on this topic for a while now!

~~~
k__
Some people told me they were searching for this.

I think I'll put up a small splashpage for email gathering in the next days to
keep people up to date :)

------
koehler
Is this still valid ?

~~~
freedomben
Yeah absolutely. Things change fast but not _that_ fast :-)

That said it's not a perfect guide. You need to adjust based on area of
expertise and comfort, and also any business requirements. For example in most
businesses protecting the database is important, so running only one instance
with no backups/snapshotting would be bad, reckless even. If it's just a hobby
project tho, sure. I've done it several times just to deploy a simple POC that
can be actually used.

Is there anything specific you had a question about?

~~~
n_ary
Offtopic: I've never worked in anything that has > 100 (200 on good days)
users. When you have >10k or 100k users, what really changes? Do you have to
rewrite everything? Do you spin up more machines? Profile & optimise any/all
bottlenecks to eek out more perf? Am really curious as, am not sure how
something of that scale looks like. I always imagine those always are
rewritten in C/C++/Java or Rust(current years) or have very special
architecture etc. :)

~~~
freedomben
haha, nope I've scaled ruby on rails apps (ruby is one of my favorite
languages so I say this with love: it is fat and resource hungry) to huge
levels. Basically:

> _Do you spin up more machines?_

Is correct. We make sure that the app itself is stateless (very important for
horizontal scalability), and then I set up auto-scaling groups in AWS (or more
recently actually, pods in kubernetes that auto-scale). For database I use
sharding, tho lately I've gotten away from sharding because it complicates the
database queries and you would be amazed at how far you can vertically scale
an RDS Aurora instance.

You do have to profile expensive database queries tho. Basically if your app
is stateless and your database queries are performant, you can scale really
far by just adding instances (more machines).

That said, there is a point where re-writing is attractive: cost. Adding these
instances is expensive. There was a great blog post that I'll try to find
where a service in RoR had an app server for every 400 to 500 concurrent
users, and rewrote in Go and was able to use one app server for every 10,000
concurrent users.

~~~
n_ary
wow! Thanks for taking the time to explain. Your idea of stateless apps gave
me a new perspective, something to try on next project. The article you
mention should also be an interesting read. :)

