Hacker News new | past | comments | ask | show | jobs | submit login
A Beginner's Guide to Scaling to 11M Users on Amazon's AWS (2016) (highscalability.com)
253 points by febin on Dec 26, 2018 | hide | past | favorite | 91 comments

Here's some anecdata from someone serving 1.5 million monthly users with a realtime (chat etc.) web app. Disregarding the file serving functionality, this is the entire infrastructure:

- The frontend is running on a 30 bucks/mo VPS with cycles and RAM to spare. - There's 2 DB servers for 5 bucks / mo for redundancy.

This thing would scale pretty linearly in theory, so for 10 million users just add like 4 servers.

So about 200 bucks / mo to run your average startup's web app at 10 mil users.

Alternatively write bad software and throw 50k/mo at AWS I guess.

It always amazes me how much load servers can take these days.

I guess the only thing you couldn't handle well would be large spikes in demand.

There was some spikes that brought the site down. A notable one was when I submitted the site to HN back when that kind of traffic was still a lot. You learn something each time...

Do you have more details? How did you write your app in such a way that isn't "bad?"

It's a node app that tries to keep things simple, i.e. no big frameworks. I put thought into performance as well and it has seen a significant number of re-writes. I didn't just get there from nothing. The project is about 6 years old.

Edit: I should add that I ran into performance hurdles now and then, causing me to upgrade to more expensive hardware until I had resolved them. It was mostly memory leaks or inefficient IO/websockets. Nothing took longer than a week to fix so far though.

What’s the typical QPS like?

Since it's essentially an one-page app I can do a lot via persistent websocket connections (of which there are usually ~25k concurrently).

It was probably pretty low.

Curious about the $2.5/mo database deal. What is that service?

It's $5 per VPS, each just running a redis database. One is with OVH, the other is with ip-projects.

> Amazon S3 is an object base store. > Highly durable, 11 9’s of reliability.

My eye twitched when I read this, considering that I remember more than one aws outage that affected s3 availability/reliability.

AWS own pages mention 10 nines for /durability/, however, I doubt this durability number, this level is almost unachievable by any end customer product. Could it be some clever durability accounting?

I'm quite disappointed on AWS service status reporting when the service may be disrupted so so badly, to the point of not being serve a single request, however status pages not mentioning problems at all, or just reporting "increased error rates".

The quoted 11 nines durability number is saying that you won't lose your data. Availability has nothing to do with it. It seems the OP used the wrong word and should have said "durability" instead of "reliability".

> this level is almost unachievable by any end customer product.

Not sure what your point is here, unless you're highlighting the benefits of using a cloud provider. AWS is not using an "end customer product" to implement S3. They're providing a managed service, and their implementation and management is what allows it to achieve those levels of durability.

Availability and durability are different.

S3 has 11 9's of durability. This means you can expect to lose one out of 100,000,000,000 objects stored in S3, during an unspecified period of time. S3 achieves this by using "erasure codes", distributing chunks across hundreds of hard drives, and quickly replacing lost chunks. Hadoop is a popular storage system that can do erasure coding on your own servers. To make it more durable, you will need many reliable hard drives. But I didn't find any indication that Hadoop detects and replaces lost chunks automatically, so its durability is severely limited compared to S3.

S3 has 4 9's of availability. This means you must expect S3 to be inaccessible 0.0001 of the time, or about an hour per year.

One important thing it doesn't discuss is the insane AWS costs compared to the other clouds and even regular self-managed solutions running on cheap VPSes and dedicated servers.

Until very recently, cheap VPS providers did not offer private networking between customer VMs, making them a completely different animal.

When you get network level ddos on digital ocean, they can't save you.

This rules out small clouds for us.

So, we use GCP Azure AWS exclusively because of their ability to defy network level ddos.

This is an interesting point. Did you originally use a non network-level-ddos-defying CSP and then switch? I am curious when this became a variable to consider as an item to explicitly pay more for (e.g. going from AWS Shield standard to advanced or picking big-3 CSP with higher price because of DDOS-protection) / when the inflection point in the business where DDOS-protection is now a serious consideration due to financial impact/user impact occurred (if possible to point to).

Yes, we used baremetal and digital ocean before that.

Why? Mostly because of cost and it seemed simple than building out stuff on cloud services.

Every Friday night, our services got ddosed by our competitors.

Provider would nullroute our IPs and our service goes down

We struggled with it a lot since we were not big enough to afford a premium ddos solution.

Once we realized that big cloud like aws, gcp do not suffer from this, we had to make a switch.

How do you manage to mitigate a DDOS on a public cloud?

When you say GCP, Azure and AWS have the capacity to defy the ddos, what capacity are you reffering to?

Are you talking about actually scaling and serving the bogus requests? Or capacity to have enoough bandwith and firewall power to fend it off?

I can only talk for AWS, not GCP or Azure, but there are services that can help mitigate DDOS attacks:

Shield (https://aws.amazon.com/shield/) is AWS's DDOS protection service. It's free and is basic protection against L3/4 attacks.

Shield Advanced (same URL as above) is a big step up in price, but gives you access to 'improved protection' and a global response team.

Cloudfront (https://aws.amazon.com/cloudfront) is a CDN with global edge locations.

WAF (https://aws.amazon.com/waf) is AWS's web application firewall service. It's less about DDOS and more about specific application attacks, but is part of the whole solution.

For more detail, you can have a look at AWS's DDOS whitepaper: https://d0.awsstatic.com/whitepapers/Security/DDoS_White_Pap...

The clouds don't do any filtering or mitigation for free. They just have enough bandwidth to pass the attack through to your servers and services. You're just moving the bottleneck here, as now you need to use a cloud DDoS mitigator like Silverline, Cloudflare, prolexic.

You probably would have been better off using a cloud mitigator from the start. Their pricing is competitive when you factor in all of the costs.

They do filter the malacious traffic if you use their loadbalancer. Loadbalancer is shared across the user accounts, so amazon has to stop the ddos.

They've very effective network level ddos detector/filtering

Put cloudflare in front then, you can use it with Digital Ocean.

So you're saved then.

Or Akamai’s mitigation services if you can afford it. I’m curious how big that value proposition is these days.

I wonder how much DDoS it takes before a cloud provider starts dropping a customer’s packets now. Do they even bother anymore?

Other VPS like DO are toys in enterprise they don't offer anything useful, you get cheap VPS but that's it, DO is good for home side project.

Unless your product is similar to Dropbox or a side project for fun and giggles, anyone complaining about "insane" AWS costs is building a product with no viable business opportunity, or just doesn't understand all the direct and indirect costs associated with running and maintaining infrastructure in-house.

Clouds are cheap for CPU. They're really expensive for storage and network. Most services still need to store bits and send bits, viable or not.

I disagree that they are cheap for CPU - compared to AWS, GCP and Azure, you get much more for your money with a VPS.

What are you going on about man? Plenty of companies find AWS at various stages of growth and profitability. When a company chooses to trade velocity for margin varies, and with the looming recession, will likely have different patterns. So many startup wannabes have never lived in a funding scarce environment where costs really matter.

And just remember, public clouds are expensive. If you need to handle millions of requests per second then there is a very high probability that you can afford a dedicated infrastructure team and save a ton of money while working just as well.

Just look at the insane(!!!) bandwidth costs...

I'm a bit naive when it comes to infrastructure, since I've never managed it myself. I've always been in companies with dedicated infrastructure :).

Why does Netflix outsource to AWS?

The gripes I've had with dedicated infrastructure team is their knowledge/experience can be hit or miss. My current company is about saving money, so sometimes there's junior guys on there. Or there's guys who need to handle 20 different products, they're good at 10..but suck at the other 10. I personally don't believe they have best practices (but I'm not an infrastructure expert), because downtime happens.

Netflix couldn’t predict its load in the early days, so saved a ton of money using pay-as-you-ngo rather than building out datacenters.

Nowadays they are such a good advert for AWS they probably negotiated a deal which is way cheaper than the published price.

Also Netflix’ major cost is content; servers are kind of a rounding error.

AFAIK, Netflix does not stream out of AWS, but instead "just" runs their application services on AWS (which is still a significant workload).

I seem to recall they offered a rack full of their own caching equipment to medium and larger ISPs (to save both sides streaming costs while providing better service to Netflix customers). Basically a private app-specific CDN.

It's true. If you open Web Inspector while streaming a Netflix movie, the URL it's getting the stream from probably contains your ISP name and region!

Netflix maintains their own CDN for video content. They place a rack (multiple racks?) in many different datacenters close to the ISP last mile (possibly within some ISP's own datacenters?).


Who knows,but I bet some sweet deal was involved. When Netflix started using AWS it wasn’t a thing, it was pretty bad datacenter. Even amazon.com wasn’t using aws (contrary to their marketing material).

Explain how do you replicate the features of AWS yourself? There is a reason why OpenStack never really took of.

A Storage team to manage MySQL/Postgres. A Compute team to run Kubernetes/Mesos. A Networking team to manage service discovery, load balancing, structured RPC, etc.

Nothing precludes using AWS/GCP in limited ways, where the value proposition is especially good. Just not for the base capacity in your core infrastructure.

I’m sure this is a great idea at some San Francisco start up full of unicorns who can do everything backed by endless venture capitalist cash.

I have spent a good chunk of my career at a Fortune 10 energy company.

We hired people who specialized in working with SAP, Oracle databases, MSSQL, messaging (think MSFT Lync), email infrastructure, physical servers for web apps, Cisco, desktop provisioning, application development, mobile device management, ticketing software, etc. etc.

It’s very hard and very expensive to acquire the right talent to manage a particular technology, administer it, support it, etc. Every vended software we work with has some company that looks at our name and comes up with some outrageous cost for a few licenses, dev seats, or whatever their business model is.

Acquiring the right quality talent is even harder. You bring them in. You try to teach them about your business so they feel valued and part of something big. They don’t stick around. They want to be at the Googles and Amazons of the world. They leave.

In this scenario, I’d rather use SQS than try to interview, hire, train, and make the hardware/software investment in running an SQS equivalent service in house. I don’t want to hire DBAs and build that profession in my org. It’s not our core business. My customers in the business don’t give a shit about it. They want something fast, reliable, and dependable.

Fundamentally, this argument of just do it yourself to save money is short sighted. There are a lot of factors. If you believe hiring Some internal teams is always cheaper than using managed services, you’re in for a world of hurt.

The Google's of the world became such because they decided to invest in homegrown talent. You are pouring millions of dollars into purchasing and maintaining software, and in your own words your customers want something fast, reliable, dependable. Are you so sure software isn't a core part of your business?

Googles core products required them to hire people that could interview, hire, design, build, and deploy software. Along the way, they had many failures and learned a lot of lessons.

If my company’s core products or services are centered on let’s say acquiring surface and mineral rights, discovering hydrocarbons, and using the right technology for drilling, it makes practically zero sense for me to now also try to become an expert at maintaining servers, hiring top tier software engineer, etc etc. I don’t have the kinds of problems to keep them engaged nor do I have the budget.

Software is a part of the business insofar that it enables business, but it’s not ultimately what drives profit.

Many of the top tech companies outsource HR software or recruiting entirely, use travel and expense software software as a service, Oracle financials for payroll instead of writing it themselves, etc etc. Why? Because they want to invest their money and talent into the core areas that make them a profit.

In your example it's called service teams, but where do those services are running? OP said AWS too expensive so do it yourself, your example doesn't describe what OP is asking for: DC infra, managed / rented servers same for networking / storage.

I guess public clouds are mostly useful when you have a sudden peak in visits, or if you need to scale up very quickly.

They say "you need to add monitoring, metrics and logging" at the 500k user mark. That should be happening with user 0 with few exceptions that I can think of.

I feel like this is the reason stuff doesn't get shipped. If you couldn't release v.01 without adding in a robust analytics suite, nothing would ever get released.

Nothing wrong with releasing a .01v and not putting priority on anything but basic functionality until you start to grow. Half of this means you need to build out functionality that doesn't require heavy monitoring, metrics, and logging to get the job done. All that infrastructure stuff is more fun to code when you've blown through the free tier of S3 storage anyway.

Nothing heavy needed. You should know that your thing is working. A basic health check. You should know what errors are happening with basic logging. As soon as you have a paying user, you should have some form of reporting or alerting on this. You should have some basic metric(s) in place very early on like requests per second or sign ups. This should take you in the order of hours, not days, to set up. Nothing heavy. But without basic visibility into your system, you are asking for problems.

What tools would you use to set this up?

Personally, if I were just starting something (assuming non-aws), I would use a free pinging service or write a quick cron to do it. In the event that it did not come back healthy, I would either send myself an email or an sms - the health check can also report error counts or use storage to track number of restarts per unit of time. Logging would just be some basic logging rotation and stdout piped to that file. Basic metrics could be as simple as a metrics table that you can query with SQL. When you eventually need fancy graphs (maybe a bit later), there are lots of options.

I'm not overly familiar with aws services, but I'd be very surprised they don't give a basic good-enough solution to this with cloudwatch, which is even less work than outlined above.

Depends on your stack and requirements (do you want to know about errors ASAP, or is a 2-5 minute delay ok?), but I personally love NewRelic because of how easy it is to set up (and the number of features that it has).

If you want tools that you can manage yourself, then a combination of StatsD + Grafana for metrics, and Sentry for errors. For logs, Graylog if you want to set up something complicated but powerful, and syslog-ng if you just want to dump the logs somewhere (so you can simply grep them).

Most of these tools cost too much to scale past 1M users.

You could write your own service... some thin agent that runs on your boxes and dumps the files every hour to some storage optimized boxes (your data lake)... where another process picks up those files periodically (or by getting a notification that something new is present) and loads them into a Postgres instance (you actually probably want column oriented).

Running every hour, you won’t get up to the second or minute data points. For more critical sections of code, maybe have your code directly log to an agent on the machine that periodically flushes those calls to some logging service.

1. Collectd and Graphite serve this well or can be modified

2. Nine of these commercial services gives you per second granularity that I have seen


#if there are 9 i'd like to know of them :-)

Recently I tried logz.io In free plan you have 3 days retention, 2gb of logs daily, automatic notifications (insights - your syslog is parsed and common errors are recognized). And they have one click set up of visualisations in kibana.

This is a written summary (like all High Scalability articles) of this video: https://www.youtube.com/watch?v=vg5onp8TU6Q

Some of this advice seems like wildly premature optimization. Splitting into multiple hosts and/or using a database service at 10 users?

At 10 users, you should be focusing on growth, not fussing with your tech stack.

I wrote the first version of this deck back in 2013: https://www.slideshare.net/AmazonWebServices/scaling-on-aws-... and it since then went on to become one of the most delivered talks by AWS across the world. I thought I'd pop into this thread and answer some questions/help provide guidance as to why it exists. Note that first deck was mostly tossed together while on an Amtrak from NYC to Philly in an early morning, so excuse the typos and graphics glitches.

Generally speaking we've/I've seen a number of customers struggle with pre-optimization of their infrastructure and/or just not knowing what they should think about and when, when it comes to scale and architecture patterns. Think of this deck as an 80/20 rule for the most general and basic of people. Also, 2013 was a very very different world in the cloud and in infrastructure than today.

This deck was all pre-Docker/k8s, pre-Lambda/FaaS, pre-alot-of-cool-db-tech. However as a general starting point a lot of it still holds just fine. You should probably start with a relational DB if you have a more traditional background using them. You probably don't need autoscaling Day 1 (but it's nice via things like ElasticBeanstalk).

Someone commented that metrics/logging/monitoring is a Day 0 thing, and it absolutely is, but it is the kind of thing most companies skimp out on until very late in the game. Not pushing for excellence in this area will block you from success down the line. The same is hiring someone with dedicated Ops/SRE responsibilities, and/or making sure someone on the team has the right experience in that space.

DB growth/scaling is now a lot more transparent than it used to be, but I still see people doing less than ideal things with them. This is an area the industry needs to be sharing more best practices on (while still respecting some of the best practices from decades of SQL).

On costs; today some of the biggest sites in the world run on a public cloud. They do that for many reasons, but cost effectiveness at scale is pretty well proven when measured the right way. Most math ignores people and opportunity cost and that's what get's you more than metal and power. There is also now today the wealth of managed services that equate to replacement of people with a returned value greater than the people cost (essentially the ROI on managed services outpaces the value of you doing it yourself greatly). The SaaS ecosystem is also vastly larger than it was in 2013. I hope to never run a logging cluster, monitoring systems, backup tools, etc, myself again.

Anyway, glad to see this deck still kicks around and that people are discussing it. Happy to try and answer more questions on this. - munns@

A glaring omission in AWS offerings is a NewSQL database that let's you start with SQL, have redundancy and scale horizontally having consistent reads the whole time. Scaling to 10M+ users and then switching datastores makes for good story writeups because it's hard. It's also avoidable.

Doesn't Aurora fit that bill?

Aurora has good replication but it's still just master/slaves rather than sharded. DynamoDB is sharded but nosql.

Thanks, wow, I guess I just never dug deep enough past the Aurora marketing, which I see now actually talks about its "distributed storage system".

I don't think it's necessarily a good idea to use SQL for everything. If you are constantly inserting records and don't need relational structures, DynamoDB might be a better choice. DynamoDB scales quickly and cheaply, so it's potentially the preferable option if it works in your architecture.

It's relatively easier to iterate your data storage with relational tables though. Dynamite is great if you are sure a component just needs a k/v store, and you are sure you will never want to do a join.

I think the guide does not work when you have less users and more data, for instance if you are a b2b saas provider. I think the way you architect an infrastructure should be driven by the value of the user.


Does anybody have an equally short summary of any differences today?

By the way, there are links below the text, here's the one to the HN discussion of 2016: https://news.ycombinator.com/item?id=10885727

Today we use

For database, we use RDS/Dynamodb

Redis for cache.

Dynamodb is better in cases where we want to localize the latency of our regional lambdas.

RDS for everything else like dashboard entity storage etc..

Cloudwatch prints logs, kinesis takes the log to s3 where it's transformed in batches with lambda then data is moved to Redshift. Redshift for stats/reports.

Converted whole ad network to Serverless. Used Rust Lambda runtime for CPU intensive tasks.

Using Go for the rest of the Lambdas.

I love Go and Rust and optimizing just one Lambda at a time brought back the programme joy in my life

Used Apex+Terrform to manage whole infra for the ad network.

We managed to deploy Lambda in all AWS regions ensuring minium latency for the ad viewers.

The thing which took over 50 (tech side only) person team, now takes 10 people to run the whole ad network.

Network is doing 20M profit per year/9 billion clicks per day.

It's my greatest achievement so far because we made it from scratch without any investor money.

But one the other side, we'll have to shrink our team size next year as growth opportunity is limited and we want to optimize efficiency further.

Pretty awesome!

I'm currently planing to write a book about AWS. It should teach people how to build their MVPs without restricting themselves in the future.

Are you available for an interview in January?

I've been looking for a book on this topic for a while now!

Some people told me they were searching for this.

I think I'll put up a small splashpage for email gathering in the next days to keep people up to date :)


I'll start gathering informations in January, feel free to share this form.

Did you mean 9 billion clicks or impressions daily?

50 person team to run an adnetwork on tech side only? I am really curious why did it take that many people before going to Lambdas. We are in the adtech space also and there is a 5 persons team (on-call ops+2 devs) to run our own datacollection, third party data providers, RTB doing half a million QPS and own adservers doing hundreds of millions of impressions daily.

Sounds really interesting, kudos for building a profitable business from scratch. I have no experience with redshift, we mostly use the ELK stack, so Kibana do to all the log analysis. Is redshift significantly better?

Using redshift for metrics, mostly OLAP.

Think about drilldown to 3 level, based on device, os, placement, country, ISP etc... along with click stats per variable.

I've never used elastic search for this.

Before that used bigquery but every query takes atleast 2 seconds.

So we had to move to a dedicated redshift cluster.

Make your next project getting off of AWS and you'll save enough money to keep people on your team. :)

So what is the alternative? Maintaining your own infrastructure like we did before "cloud" providers, i.e. your own dedicated servers in managed locations unless you were huge enough to have your own locations? Or just a different cloud provider? It is hard to check if your suggestion is any better since you only say "don't do that", but not what else to do instead...

What's your definition of huge? Just curious as it's still really cheap to rent racks even in top tier datacenters.

> What's your definition of huge?

Quoting myself:

> unless you were huge enough to have your own locations

Those locations are millions, sometimes hundreds of millions of dollars investments with backup power generators large enough to provide power to a comfortably sized village. So, "large enough to a) need and b) able to afford owning such a location just for your own needs", e.g. Google, Amazon. Even companies like large banks have their servers co-hosted in a separate section but in the location owned by a 3rd party co-hosting provider. To own one you either are one of those providers or you are in the "Google tier". For the purposes of the current context, the linked article, one would even need to have multiple such locations all over the world. I think that qualifies as "huge" (the company owning such infrastructure just to run their own servers, co-hosting firms do it for others).

You don't need to build and run your own datacenter to self-host. That's just ridiculous to think that's a requirement. Colo is more than fine.

We did go that route as well in past but costs were insane, experienced talent hard to find and doesn't come cheap.

Cloud has talent working in the background on AWS's payroll, they've better ability too hire at scale than what we can do.

So decided to use them, no we don't regret. It's a more reliable cost then hiring and managing a team which might prove to be less reliable.

They do filter the malacious traffic if you use their loadbalancer.

Loadbalancer is shared across the user accounts, so amazon has to stop the ddos.

They've very effective network level ddos detector/filtering

I forgot to mention, lambda model is very easy to reason about and costs can be forecasted with more accuracy than running a VM.

Once you have setup Lambda in one region

You just need to loop through the list of regions and deploy your lambda in ALL AVAILABLE REGIONS. Yes, it's that simple!

API gateway doesn't charge for 4xx response, so it's very good for defying level 7 ddos too.

Add cognito and use lambda authorizer, it generates API keys and emails it to your users.

Add a Latency Based DNS routing using route 53 on top and you ensure minium latency in all regions!

What happens if someone sends a layer 7 DDoS that you do respond to with a 200? Or 300-level?

Then you're screwed I suppose. To do that they'd likely be performing some sort of replay attack, in which case you should be mitigating against this. There's no magic bullet anywhere.

Is this still valid ?

Yeah absolutely. Things change fast but not that fast :-)

That said it's not a perfect guide. You need to adjust based on area of expertise and comfort, and also any business requirements. For example in most businesses protecting the database is important, so running only one instance with no backups/snapshotting would be bad, reckless even. If it's just a hobby project tho, sure. I've done it several times just to deploy a simple POC that can be actually used.

Is there anything specific you had a question about?

Offtopic: I've never worked in anything that has > 100 (200 on good days) users. When you have >10k or 100k users, what really changes? Do you have to rewrite everything? Do you spin up more machines? Profile & optimise any/all bottlenecks to eek out more perf? Am really curious as, am not sure how something of that scale looks like. I always imagine those always are rewritten in C/C++/Java or Rust(current years) or have very special architecture etc. :)

haha, nope I've scaled ruby on rails apps (ruby is one of my favorite languages so I say this with love: it is fat and resource hungry) to huge levels. Basically:

> Do you spin up more machines?

Is correct. We make sure that the app itself is stateless (very important for horizontal scalability), and then I set up auto-scaling groups in AWS (or more recently actually, pods in kubernetes that auto-scale). For database I use sharding, tho lately I've gotten away from sharding because it complicates the database queries and you would be amazed at how far you can vertically scale an RDS Aurora instance.

You do have to profile expensive database queries tho. Basically if your app is stateless and your database queries are performant, you can scale really far by just adding instances (more machines).

That said, there is a point where re-writing is attractive: cost. Adding these instances is expensive. There was a great blog post that I'll try to find where a service in RoR had an app server for every 400 to 500 concurrent users, and rewrote in Go and was able to use one app server for every 10,000 concurrent users.

wow! Thanks for taking the time to explain. Your idea of stateless apps gave me a new perspective, something to try on next project. The article you mention should also be an interesting read. :)

> Things change fast but not that fast

Things barely ever change - engineers just love solving the exact same problems over and over again ;)

I'm planning to use it as a roadmap of how many things I should learn :)

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact