Here's some anecdata from someone serving 1.5 million monthly users with a realtime (chat etc.) web app. Disregarding the file serving functionality, this is the entire infrastructure:
- The frontend is running on a 30 bucks/mo VPS with cycles and RAM to spare.
- There's 2 DB servers for 5 bucks / mo for redundancy.
This thing would scale pretty linearly in theory, so for 10 million users just add like 4 servers.
So about 200 bucks / mo to run your average startup's web app at 10 mil users.
Alternatively write bad software and throw 50k/mo at AWS I guess.
There was some spikes that brought the site down. A notable one was when I submitted the site to HN back when that kind of traffic was still a lot. You learn something each time...
It's a node app that tries to keep things simple, i.e. no big frameworks. I put thought into performance as well and it has seen a significant number of re-writes. I didn't just get there from nothing. The project is about 6 years old.
Edit: I should add that I ran into performance hurdles now and then, causing me to upgrade to more expensive hardware until I had resolved them. It was mostly memory leaks or inefficient IO/websockets. Nothing took longer than a week to fix so far though.
> Amazon S3 is an object base store.
> Highly durable, 11 9’s of reliability.
My eye twitched when I read this, considering that I remember more than one aws outage that affected s3 availability/reliability.
AWS own pages mention 10 nines for /durability/, however, I doubt this durability number, this level is almost unachievable by any end customer product. Could it be some clever durability accounting?
I'm quite disappointed on AWS service status reporting when the service may be disrupted so so badly, to the point of not being serve a single request, however status pages not mentioning problems at all, or just reporting "increased error rates".
The quoted 11 nines durability number is saying that you won't lose your data. Availability has nothing to do with it. It seems the OP used the wrong word and should have said "durability" instead of "reliability".
> this level is almost unachievable by any end customer product.
Not sure what your point is here, unless you're highlighting the benefits of using a cloud provider. AWS is not using an "end customer product" to implement S3. They're providing a managed service, and their implementation and management is what allows it to achieve those levels of durability.
S3 has 11 9's of durability. This means you can expect to lose one out of 100,000,000,000 objects stored in S3, during an unspecified period of time. S3 achieves this by using "erasure codes", distributing chunks across hundreds of hard drives, and quickly replacing lost chunks. Hadoop is a popular storage system that can do erasure coding on your own servers. To make it more durable, you will need many reliable hard drives. But I didn't find any indication that Hadoop detects and replaces lost chunks automatically, so its durability is severely limited compared to S3.
S3 has 4 9's of availability. This means you must expect S3 to be inaccessible 0.0001 of the time, or about an hour per year.
One important thing it doesn't discuss is the insane AWS costs compared to the other clouds and even regular self-managed solutions running on cheap VPSes and dedicated servers.
This is an interesting point. Did you originally use a non network-level-ddos-defying CSP and then switch? I am curious when this became a variable to consider as an item to explicitly pay more for (e.g. going from AWS Shield standard to advanced or picking big-3 CSP with higher price because of DDOS-protection) / when the inflection point in the business where DDOS-protection is now a serious consideration due to financial impact/user impact occurred (if possible to point to).
WAF (https://aws.amazon.com/waf) is AWS's web application firewall service. It's less about DDOS and more about specific application attacks, but is part of the whole solution.
The clouds don't do any filtering or mitigation for free. They just have enough bandwidth to pass the attack through to your servers and services. You're just moving the bottleneck here, as now you need to use a cloud DDoS mitigator like Silverline, Cloudflare, prolexic.
You probably would have been better off using a cloud mitigator from the start. Their pricing is competitive when you factor in all of the costs.
Unless your product is similar to Dropbox or a side project for fun and giggles, anyone complaining about "insane" AWS costs is building a product with no viable business opportunity, or just doesn't understand all the direct and indirect costs associated with running and maintaining infrastructure in-house.
What are you going on about man? Plenty of companies find AWS at various stages of growth and profitability. When a company chooses to trade velocity for margin varies, and with the looming recession, will likely have different patterns. So many startup wannabes have never lived in a funding scarce environment where costs really matter.
And just remember, public clouds are expensive. If you need to handle millions of requests per second then there is a very high probability that you can afford a dedicated infrastructure team and save a ton of money while working just as well.
I'm a bit naive when it comes to infrastructure, since I've never managed it myself. I've always been in companies with dedicated infrastructure :).
Why does Netflix outsource to AWS?
The gripes I've had with dedicated infrastructure team is their knowledge/experience can be hit or miss. My current company is about saving money, so sometimes there's junior guys on there. Or there's guys who need to handle 20 different products, they're good at 10..but suck at the other 10. I personally don't believe they have best practices (but I'm not an infrastructure expert), because downtime happens.
AFAIK, Netflix does not stream out of AWS, but instead "just" runs their application services on AWS (which is still a significant workload).
I seem to recall they offered a rack full of their own caching equipment to medium and larger ISPs (to save both sides streaming costs while providing better service to Netflix customers). Basically a private app-specific CDN.
Netflix maintains their own CDN for video content. They place a rack (multiple racks?) in many different datacenters close to the ISP last mile (possibly within some ISP's own datacenters?).
Who knows,but I bet some sweet deal was involved. When Netflix started using AWS it wasn’t a thing, it was pretty bad datacenter. Even amazon.com wasn’t using aws (contrary to their marketing material).
A Storage team to manage MySQL/Postgres. A Compute team to run Kubernetes/Mesos. A Networking team to manage service discovery, load balancing, structured RPC, etc.
Nothing precludes using AWS/GCP in limited ways, where the value proposition is especially good. Just not for the base capacity in your core infrastructure.
I’m sure this is a great idea at some San Francisco start up full of unicorns who can do everything backed by endless venture capitalist cash.
I have spent a good chunk of my career at a Fortune 10 energy company.
We hired people who specialized in working with SAP, Oracle databases, MSSQL, messaging (think MSFT Lync), email infrastructure, physical servers for web apps, Cisco, desktop provisioning, application development, mobile device management, ticketing software, etc. etc.
It’s very hard and very expensive to acquire the right talent to manage a particular technology, administer it, support it, etc. Every vended software we work with has some company that looks at our name and comes up with some outrageous cost for a few licenses, dev seats, or whatever their business model is.
Acquiring the right quality talent is even harder. You bring them in. You try to teach them about your business so they feel valued and part of something big. They don’t stick around. They want to be at the Googles and Amazons of the world. They leave.
In this scenario, I’d rather use SQS than try to interview, hire, train, and make the hardware/software investment in running an SQS equivalent service in house. I don’t want to hire DBAs and build that profession in my org. It’s not our core business. My customers in the business don’t give a shit about it. They want something fast, reliable, and dependable.
Fundamentally, this argument of just do it yourself to save money is short sighted. There are a lot of factors. If you believe hiring Some internal teams is always cheaper than using managed services, you’re in for a world of hurt.
The Google's of the world became such because they decided to invest in homegrown talent. You are pouring millions of dollars into purchasing and maintaining software, and in your own words your customers want something fast, reliable, dependable. Are you so sure software isn't a core part of your business?
Googles core products required them to hire people that could interview, hire, design, build, and deploy software. Along the way, they had many failures and learned a lot of lessons.
If my company’s core products or services are centered on let’s say acquiring surface and mineral rights, discovering hydrocarbons, and using the right technology for drilling, it makes practically zero sense for me to now also try to become an expert at maintaining servers, hiring top tier software engineer, etc etc. I don’t have the kinds of problems to keep them engaged nor do I have the budget.
Software is a part of the business insofar that it enables business, but it’s not ultimately what drives profit.
Many of the top tech companies outsource HR software or recruiting entirely, use travel and expense software software as a service, Oracle financials for payroll instead of writing it themselves, etc etc. Why? Because they want to invest their money and talent into the core areas that make them a profit.
In your example it's called service teams, but where do those services are running? OP said AWS too expensive so do it yourself, your example doesn't describe what OP is asking for: DC infra, managed / rented servers same for networking / storage.
They say "you need to add monitoring, metrics and logging" at the 500k user mark. That should be happening with user 0 with few exceptions that I can think of.
I feel like this is the reason stuff doesn't get shipped. If you couldn't release v.01 without adding in a robust analytics suite, nothing would ever get released.
Nothing wrong with releasing a .01v and not putting priority on anything but basic functionality until you start to grow. Half of this means you need to build out functionality that doesn't require heavy monitoring, metrics, and logging to get the job done. All that infrastructure stuff is more fun to code when you've blown through the free tier of S3 storage anyway.
Nothing heavy needed. You should know that your thing is working. A basic health check. You should know what errors are happening with basic logging. As soon as you have a paying user, you should have some form of reporting or alerting on this. You should have some basic metric(s) in place very early on like requests per second or sign ups. This should take you in the order of hours, not days, to set up. Nothing heavy. But without basic visibility into your system, you are asking for problems.
Personally, if I were just starting something (assuming non-aws), I would use a free pinging service or write a quick cron to do it. In the event that it did not come back healthy, I would either send myself an email or an sms - the health check can also report error counts or use storage to track number of restarts per unit of time. Logging would just be some basic logging rotation and stdout piped to that file. Basic metrics could be as simple as a metrics table that you can query with SQL. When you eventually need fancy graphs (maybe a bit later), there are lots of options.
I'm not overly familiar with aws services, but I'd be very surprised they don't give a basic good-enough solution to this with cloudwatch, which is even less work than outlined above.
Depends on your stack and requirements (do you want to know about errors ASAP, or is a 2-5 minute delay ok?), but I personally love NewRelic because of how easy it is to set up (and the number of features that it has).
If you want tools that you can manage yourself, then a combination of StatsD + Grafana for metrics, and Sentry for errors. For logs, Graylog if you want to set up something complicated but powerful, and syslog-ng if you just want to dump the logs somewhere (so you can simply grep them).
You could write your own service... some thin agent that runs on your boxes and dumps the files every hour to some storage optimized boxes (your data lake)... where another process picks up those files periodically (or by getting a notification that something new is present) and loads them into a Postgres instance (you actually probably want column oriented).
Running every hour, you won’t get up to the second or minute data points. For more critical sections of code, maybe have your code directly log to an agent on the machine that periodically flushes those calls to some logging service.
Recently I tried logz.io
In free plan you have 3 days retention, 2gb of logs daily, automatic notifications (insights - your syslog is parsed and common errors are recognized). And they have one click set up of visualisations in kibana.
I wrote the first version of this deck back in 2013: https://www.slideshare.net/AmazonWebServices/scaling-on-aws-... and it since then went on to become one of the most delivered talks by AWS across the world. I thought I'd pop into this thread and answer some questions/help provide guidance as to why it exists. Note that first deck was mostly tossed together while on an Amtrak from NYC to Philly in an early morning, so excuse the typos and graphics glitches.
Generally speaking we've/I've seen a number of customers struggle with pre-optimization of their infrastructure and/or just not knowing what they should think about and when, when it comes to scale and architecture patterns. Think of this deck as an 80/20 rule for the most general and basic of people. Also, 2013 was a very very different world in the cloud and in infrastructure than today.
This deck was all pre-Docker/k8s, pre-Lambda/FaaS, pre-alot-of-cool-db-tech. However as a general starting point a lot of it still holds just fine. You should probably start with a relational DB if you have a more traditional background using them. You probably don't need autoscaling Day 1 (but it's nice via things like ElasticBeanstalk).
Someone commented that metrics/logging/monitoring is a Day 0 thing, and it absolutely is, but it is the kind of thing most companies skimp out on until very late in the game. Not pushing for excellence in this area will block you from success down the line. The same is hiring someone with dedicated Ops/SRE responsibilities, and/or making sure someone on the team has the right experience in that space.
DB growth/scaling is now a lot more transparent than it used to be, but I still see people doing less than ideal things with them. This is an area the industry needs to be sharing more best practices on (while still respecting some of the best practices from decades of SQL).
On costs; today some of the biggest sites in the world run on a public cloud. They do that for many reasons, but cost effectiveness at scale is pretty well proven when measured the right way. Most math ignores people and opportunity cost and that's what get's you more than metal and power. There is also now today the wealth of managed services that equate to replacement of people with a returned value greater than the people cost (essentially the ROI on managed services outpaces the value of you doing it yourself greatly). The SaaS ecosystem is also vastly larger than it was in 2013. I hope to never run a logging cluster, monitoring systems, backup tools, etc, myself again.
Anyway, glad to see this deck still kicks around and that people are discussing it. Happy to try and answer more questions on this. - munns@
A glaring omission in AWS offerings is a NewSQL database that let's you start with SQL, have redundancy and scale horizontally having consistent reads the whole time. Scaling to 10M+ users and then switching datastores makes for good story writeups because it's hard. It's also avoidable.
I don't think it's necessarily a good idea to use SQL for everything. If you are constantly inserting records and don't need relational structures, DynamoDB might be a better choice. DynamoDB scales quickly and cheaply, so it's potentially the preferable option if it works in your architecture.
It's relatively easier to iterate your data storage with relational tables though. Dynamite is great if you are sure a component just needs a k/v store, and you are sure you will never want to do a join.
I think the guide does not work when you have less users and more data, for instance if you are a b2b saas provider. I think the way you architect an infrastructure should be driven by the value of the user.
Dynamodb is better in cases where we want to localize the latency of our regional lambdas.
RDS for everything else like dashboard entity storage etc..
Cloudwatch prints logs, kinesis takes the log to s3 where it's transformed in batches with lambda then data is moved to Redshift.
Redshift for stats/reports.
Converted whole ad network to Serverless.
Used Rust Lambda runtime for CPU intensive tasks.
Using Go for the rest of the Lambdas.
I love Go and Rust and optimizing just one Lambda at a time brought back the programme joy in my life
Used Apex+Terrform to manage whole infra for the ad network.
We managed to deploy Lambda in all AWS regions ensuring minium latency for the ad viewers.
The thing which took over 50 (tech side only) person team, now takes 10 people to run the whole ad network.
Network is doing 20M profit per year/9 billion clicks per day.
It's my greatest achievement so far because we made it from scratch without any investor money.
But one the other side, we'll have to shrink our team size next year as growth opportunity is limited and we want to optimize efficiency further.
Did you mean 9 billion clicks or impressions daily?
50 person team to run an adnetwork on tech side only? I am really curious why did it take that many people before going to Lambdas. We are in the adtech space also and there is a 5 persons team (on-call ops+2 devs) to run our own datacollection, third party data providers, RTB doing half a million QPS and own adservers doing hundreds of millions of impressions daily.
Sounds really interesting, kudos for building a profitable business from scratch. I have no experience with redshift, we mostly use the ELK stack, so Kibana do to all the log analysis. Is redshift significantly better?
So what is the alternative? Maintaining your own infrastructure like we did before "cloud" providers, i.e. your own dedicated servers in managed locations unless you were huge enough to have your own locations? Or just a different cloud provider? It is hard to check if your suggestion is any better since you only say "don't do that", but not what else to do instead...
> unless you were huge enough to have your own locations
Those locations are millions, sometimes hundreds of millions of dollars investments with backup power generators large enough to provide power to a comfortably sized village. So, "large enough to a) need and b) able to afford owning such a location just for your own needs", e.g. Google, Amazon. Even companies like large banks have their servers co-hosted in a separate section but in the location owned by a 3rd party co-hosting provider. To own one you either are one of those providers or you are in the "Google tier". For the purposes of the current context, the linked article, one would even need to have multiple such locations all over the world. I think that qualifies as "huge" (the company owning such infrastructure just to run their own servers, co-hosting firms do it for others).
Then you're screwed I suppose. To do that they'd likely be performing some sort of replay attack, in which case you should be mitigating against this. There's no magic bullet anywhere.
Yeah absolutely. Things change fast but not that fast :-)
That said it's not a perfect guide. You need to adjust based on area of expertise and comfort, and also any business requirements. For example in most businesses protecting the database is important, so running only one instance with no backups/snapshotting would be bad, reckless even. If it's just a hobby project tho, sure. I've done it several times just to deploy a simple POC that can be actually used.
Is there anything specific you had a question about?
Offtopic: I've never worked in anything that has > 100 (200 on good days) users. When you have >10k or 100k users, what really changes? Do you have to rewrite everything? Do you spin up more machines? Profile & optimise any/all bottlenecks to eek out more perf? Am really curious as, am not sure how something of that scale looks like. I always imagine those always are rewritten in C/C++/Java or Rust(current years) or have very special architecture etc. :)
haha, nope I've scaled ruby on rails apps (ruby is one of my favorite languages so I say this with love: it is fat and resource hungry) to huge levels. Basically:
> Do you spin up more machines?
Is correct. We make sure that the app itself is stateless (very important for horizontal scalability), and then I set up auto-scaling groups in AWS (or more recently actually, pods in kubernetes that auto-scale). For database I use sharding, tho lately I've gotten away from sharding because it complicates the database queries and you would be amazed at how far you can vertically scale an RDS Aurora instance.
You do have to profile expensive database queries tho. Basically if your app is stateless and your database queries are performant, you can scale really far by just adding instances (more machines).
That said, there is a point where re-writing is attractive: cost. Adding these instances is expensive. There was a great blog post that I'll try to find where a service in RoR had an app server for every 400 to 500 concurrent users, and rewrote in Go and was able to use one app server for every 10,000 concurrent users.
wow! Thanks for taking the time to explain. Your idea of stateless apps gave me a new perspective, something to try on next project. The article you mention should also be an interesting read. :)
- The frontend is running on a 30 bucks/mo VPS with cycles and RAM to spare. - There's 2 DB servers for 5 bucks / mo for redundancy.
This thing would scale pretty linearly in theory, so for 10 million users just add like 4 servers.
So about 200 bucks / mo to run your average startup's web app at 10 mil users.
Alternatively write bad software and throw 50k/mo at AWS I guess.