- The frontend is running on a 30 bucks/mo VPS with cycles and RAM to spare.
- There's 2 DB servers for 5 bucks / mo for redundancy.
This thing would scale pretty linearly in theory, so for 10 million users just add like 4 servers.
So about 200 bucks / mo to run your average startup's web app at 10 mil users.
Alternatively write bad software and throw 50k/mo at AWS I guess.
I guess the only thing you couldn't handle well would be large spikes in demand.
Edit: I should add that I ran into performance hurdles now and then, causing me to upgrade to more expensive hardware until I had resolved them. It was mostly memory leaks or inefficient IO/websockets. Nothing took longer than a week to fix so far though.
My eye twitched when I read this, considering that I remember more than one aws outage that affected s3 availability/reliability.
AWS own pages mention 10 nines for /durability/, however, I doubt this durability number, this level is almost unachievable by any end customer product. Could it be some clever durability accounting?
I'm quite disappointed on AWS service status reporting when the service may be disrupted so so badly, to the point of not being serve a single request, however status pages not mentioning problems at all, or just reporting "increased error rates".
> this level is almost unachievable by any end customer product.
Not sure what your point is here, unless you're highlighting the benefits of using a cloud provider. AWS is not using an "end customer product" to implement S3. They're providing a managed service, and their implementation and management is what allows it to achieve those levels of durability.
S3 has 11 9's of durability. This means you can expect to lose one out of 100,000,000,000 objects stored in S3, during an unspecified period of time. S3 achieves this by using "erasure codes", distributing chunks across hundreds of hard drives, and quickly replacing lost chunks. Hadoop is a popular storage system that can do erasure coding on your own servers. To make it more durable, you will need many reliable hard drives. But I didn't find any indication that Hadoop detects and replaces lost chunks automatically, so its durability is severely limited compared to S3.
S3 has 4 9's of availability. This means you must expect S3 to be inaccessible 0.0001 of the time, or about an hour per year.
This rules out small clouds for us.
So, we use GCP Azure AWS exclusively because of their ability to defy network level ddos.
Why? Mostly because of cost and it seemed simple than building out stuff on cloud services.
Every Friday night, our services got ddosed by our competitors.
Provider would nullroute our IPs and our service goes down
We struggled with it a lot since we were not big enough to afford a premium ddos solution.
Once we realized that big cloud like aws, gcp do not suffer from this, we had to make a switch.
When you say GCP, Azure and AWS have the capacity to defy the ddos, what capacity are you reffering to?
Are you talking about actually scaling and serving the bogus requests?
Or capacity to have enoough bandwith and firewall power to fend it off?
Shield (https://aws.amazon.com/shield/) is AWS's DDOS protection service. It's free and is basic protection against L3/4 attacks.
Shield Advanced (same URL as above) is a big step up in price, but gives you access to 'improved protection' and a global response team.
Cloudfront (https://aws.amazon.com/cloudfront) is a CDN with global edge locations.
WAF (https://aws.amazon.com/waf) is AWS's web application firewall service. It's less about DDOS and more about specific application attacks, but is part of the whole solution.
For more detail, you can have a look at AWS's DDOS whitepaper: https://d0.awsstatic.com/whitepapers/Security/DDoS_White_Pap...
You probably would have been better off using a cloud mitigator from the start. Their pricing is competitive when you factor in all of the costs.
They've very effective network level ddos detector/filtering
So you're saved then.
I wonder how much DDoS it takes before a cloud provider starts dropping a customer’s packets now. Do they even bother anymore?
Just look at the insane(!!!) bandwidth costs...
Why does Netflix outsource to AWS?
The gripes I've had with dedicated infrastructure team is their knowledge/experience can be hit or miss. My current company is about saving money, so sometimes there's junior guys on there. Or there's guys who need to handle 20 different products, they're good at 10..but suck at the other 10. I personally don't believe they have best practices (but I'm not an infrastructure expert), because downtime happens.
Nowadays they are such a good advert for AWS they probably negotiated a deal which is way cheaper than the published price.
Also Netflix’ major cost is content; servers are kind of a rounding error.
I seem to recall they offered a rack full of their own caching equipment to medium and larger ISPs (to save both sides streaming costs while providing better service to Netflix customers). Basically a private app-specific CDN.
Nothing precludes using AWS/GCP in limited ways, where the value proposition is especially good. Just not for the base capacity in your core infrastructure.
I have spent a good chunk of my career at a Fortune 10 energy company.
We hired people who specialized in working with SAP, Oracle databases, MSSQL, messaging (think MSFT Lync), email infrastructure, physical servers for web apps, Cisco, desktop provisioning, application development, mobile device management, ticketing software, etc. etc.
It’s very hard and very expensive to acquire the right talent to manage a particular technology, administer it, support it, etc. Every vended software we work with has some company that looks at our name and comes up with some outrageous cost for a few licenses, dev seats, or whatever their business model is.
Acquiring the right quality talent is even harder. You bring them in. You try to teach them about your business so they feel valued and part of something big. They don’t stick around. They want to be at the Googles and Amazons of the world. They leave.
In this scenario, I’d rather use SQS than try to interview, hire, train, and make the hardware/software investment in running an SQS equivalent service in house. I don’t want to hire DBAs and build that profession in my org. It’s not our core business. My customers in the business don’t give a shit about it. They want something fast, reliable, and dependable.
Fundamentally, this argument of just do it yourself to save money is short sighted. There are a lot of factors. If you believe hiring Some internal teams is always cheaper than using managed services, you’re in for a world of hurt.
If my company’s core products or services are centered on let’s say acquiring surface and mineral rights, discovering hydrocarbons, and using the right technology for drilling, it makes practically zero sense for me to now also try to become an expert at maintaining servers, hiring top tier software engineer, etc etc. I don’t have the kinds of problems to keep them engaged nor do I have the budget.
Software is a part of the business insofar that it enables business, but it’s not ultimately what drives profit.
Many of the top tech companies outsource HR software or recruiting entirely, use travel and expense software software as a service, Oracle financials for payroll instead of writing it themselves, etc etc. Why? Because they want to invest their money and talent into the core areas that make them a profit.
Nothing wrong with releasing a .01v and not putting priority on anything but basic functionality until you start to grow. Half of this means you need to build out functionality that doesn't require heavy monitoring, metrics, and logging to get the job done. All that infrastructure stuff is more fun to code when you've blown through the free tier of S3 storage anyway.
I'm not overly familiar with aws services, but I'd be very surprised they don't give a basic good-enough solution to this with cloudwatch, which is even less work than outlined above.
If you want tools that you can manage yourself, then a combination of StatsD + Grafana for metrics, and Sentry for errors. For logs, Graylog if you want to set up something complicated but powerful, and syslog-ng if you just want to dump the logs somewhere (so you can simply grep them).
Running every hour, you won’t get up to the second or minute data points. For more critical sections of code, maybe have your code directly log to an agent on the machine that periodically flushes those calls to some logging service.
2. Nine of these commercial services gives you per second granularity that I have seen
#if there are 9 i'd like to know of them :-)
At 10 users, you should be focusing on growth, not fussing with your tech stack.
Generally speaking we've/I've seen a number of customers struggle with pre-optimization of their infrastructure and/or just not knowing what they should think about and when, when it comes to scale and architecture patterns. Think of this deck as an 80/20 rule for the most general and basic of people. Also, 2013 was a very very different world in the cloud and in infrastructure than today.
This deck was all pre-Docker/k8s, pre-Lambda/FaaS, pre-alot-of-cool-db-tech. However as a general starting point a lot of it still holds just fine. You should probably start with a relational DB if you have a more traditional background using them. You probably don't need autoscaling Day 1 (but it's nice via things like ElasticBeanstalk).
Someone commented that metrics/logging/monitoring is a Day 0 thing, and it absolutely is, but it is the kind of thing most companies skimp out on until very late in the game. Not pushing for excellence in this area will block you from success down the line. The same is hiring someone with dedicated Ops/SRE responsibilities, and/or making sure someone on the team has the right experience in that space.
DB growth/scaling is now a lot more transparent than it used to be, but I still see people doing less than ideal things with them. This is an area the industry needs to be sharing more best practices on (while still respecting some of the best practices from decades of SQL).
On costs; today some of the biggest sites in the world run on a public cloud. They do that for many reasons, but cost effectiveness at scale is pretty well proven when measured the right way. Most math ignores people and opportunity cost and that's what get's you more than metal and power. There is also now today the wealth of managed services that equate to replacement of people with a returned value greater than the people cost (essentially the ROI on managed services outpaces the value of you doing it yourself greatly). The SaaS ecosystem is also vastly larger than it was in 2013. I hope to never run a logging cluster, monitoring systems, backup tools, etc, myself again.
Anyway, glad to see this deck still kicks around and that people are discussing it. Happy to try and answer more questions on this. - munns@
By the way, there are links below the text, here's the one to the HN discussion of 2016: https://news.ycombinator.com/item?id=10885727
For database, we use RDS/Dynamodb
Redis for cache.
Dynamodb is better in cases where we want to localize the latency of our regional lambdas.
RDS for everything else like dashboard entity storage etc..
Cloudwatch prints logs, kinesis takes the log to s3 where it's transformed in batches with lambda then data is moved to Redshift.
Redshift for stats/reports.
Converted whole ad network to Serverless.
Used Rust Lambda runtime for CPU intensive tasks.
Using Go for the rest of the Lambdas.
I love Go and Rust and optimizing just one Lambda at a time brought back the programme joy in my life
Used Apex+Terrform to manage whole infra for the ad network.
We managed to deploy Lambda in all AWS regions ensuring minium latency for the ad viewers.
The thing which took over 50 (tech side only) person team, now takes 10 people to run the whole ad network.
Network is doing 20M profit per year/9 billion clicks per day.
It's my greatest achievement so far because we made it from scratch without any investor money.
But one the other side, we'll have to shrink our team size next year as growth opportunity is limited and we want to optimize efficiency further.
I'm currently planing to write a book about AWS. It should teach people how to build their MVPs without restricting themselves in the future.
Are you available for an interview in January?
I think I'll put up a small splashpage for email gathering in the next days to keep people up to date :)
I'll start gathering informations in January, feel free to share this form.
50 person team to run an adnetwork on tech side only? I am really curious why did it take that many people before going to Lambdas. We are in the adtech space also and there is a 5 persons team (on-call ops+2 devs) to run our own datacollection, third party data providers, RTB doing half a million QPS and own adservers doing hundreds of millions of impressions daily.
Think about drilldown to 3 level, based on device, os, placement, country, ISP etc... along with click stats per variable.
I've never used elastic search for this.
Before that used bigquery but every query takes atleast 2 seconds.
So we had to move to a dedicated redshift cluster.
> unless you were huge enough to have your own locations
Those locations are millions, sometimes hundreds of millions of dollars investments with backup power generators large enough to provide power to a comfortably sized village. So, "large enough to a) need and b) able to afford owning such a location just for your own needs", e.g. Google, Amazon. Even companies like large banks have their servers co-hosted in a separate section but in the location owned by a 3rd party co-hosting provider. To own one you either are one of those providers or you are in the "Google tier". For the purposes of the current context, the linked article, one would even need to have multiple such locations all over the world. I think that qualifies as "huge" (the company owning such infrastructure just to run their own servers, co-hosting firms do it for others).
Cloud has talent working in the background on AWS's payroll, they've better ability too hire at scale than what we can do.
So decided to use them, no we don't regret. It's a more reliable cost then hiring and managing a team which might prove to be less reliable.
Loadbalancer is shared across the user accounts, so amazon has to stop the ddos.
Once you have setup Lambda in one region
You just need to loop through the list of regions and deploy your lambda in ALL AVAILABLE REGIONS. Yes, it's that simple!
API gateway doesn't charge for 4xx response, so it's very good for defying level 7 ddos too.
Add cognito and use lambda authorizer, it generates API keys and emails it to your users.
Add a Latency Based DNS routing using route 53 on top and you ensure minium latency in all regions!
That said it's not a perfect guide. You need to adjust based on area of expertise and comfort, and also any business requirements. For example in most businesses protecting the database is important, so running only one instance with no backups/snapshotting would be bad, reckless even. If it's just a hobby project tho, sure. I've done it several times just to deploy a simple POC that can be actually used.
Is there anything specific you had a question about?
> Do you spin up more machines?
Is correct. We make sure that the app itself is stateless (very important for horizontal scalability), and then I set up auto-scaling groups in AWS (or more recently actually, pods in kubernetes that auto-scale). For database I use sharding, tho lately I've gotten away from sharding because it complicates the database queries and you would be amazed at how far you can vertically scale an RDS Aurora instance.
You do have to profile expensive database queries tho. Basically if your app is stateless and your database queries are performant, you can scale really far by just adding instances (more machines).
That said, there is a point where re-writing is attractive: cost. Adding these instances is expensive. There was a great blog post that I'll try to find where a service in RoR had an app server for every 400 to 500 concurrent users, and rewrote in Go and was able to use one app server for every 10,000 concurrent users.
Things barely ever change - engineers just love solving the exact same problems over and over again ;)