Hacker News new | comments | show | ask | jobs | submit login
Docker operations slowing down on AWS (jeremyeder.com)
429 points by pradeepchhetri 10 months ago | hide | past | web | favorite | 170 comments

And then people consider me a dinosaur when I say, no cloud, just rent a server or two (not colo! just dedicated servers). Your average web service does not need to scale near infinitely; for the same amount of money you pay to Amazon you can overprovision 3-5-10 times and that'll handle your spikes. No surprises. Same amount of work: EC2 and bare metal both gives you a root prompt, go from there. These days you can get things provisioned with 24 hours -- some providers will do it in minutes. Of course, Amazon provides a lot of services beyond basic EC2 instances but if you use them you have a very ugly vendor lock and heaven forbid you want to do something else in the future...

This is like a double trap many try to sell to startups: a) you need to scale across many machines and b) the way to scale is the cloud. My take: a single machine (or two for HA) will be enough, if you really want to go big separate the web server from the database but that's it. And yes, I am in the website performance business, I worked on the video purchase platform of one of the largest British television stations and even that didn't require more than a single database server and a single Redis server for caching layer. Harken to https://stackoverflow.com/questions/5131266/increase-postgre... this question from 2011 discussing speeding up from like a thousand inserts per second at the cost of data loss -- today you will write on an SSD and don't need to risk data loss. Does your web site / app really get a thousand writes every second? I thought not. Does it even get a thousand reads? If not, then why are you building a complex database cluster...?

The other day I saw quad E7-4870 (yeah won't win any single thread contest but has 40 cores and 80 threads) 512GB RAM servers for $299 a month, with 1TB RAM for $499. Had a low end 2TB SSD for boot and you could add 8x1TB HDD w/ HW RAID for $40...

I've been running a web application for the past 6 months and it just crossed 150,000 page views/month mark. Sure, for others it's not that great, but for me, this is the project that is showing the biggest potential.

Anyway, the funny thing is, I'm running it on $2.50/month Vultr VPS. I got so worried when it crossed 30,000 that my site will crash. But it didn't. Then when its views got higher, I optimized further. I learned more things about how to make things faster. I refactored my code to squeeze out that much more juice out of $2.50 - not because I'm cheap (well, perhaps), but because I wanted to see how far I can push it.

I'll probably cross 200,000 soon and I think by then I'll need to increase it to (yikes heaven forbid!) $5/month plan.

Recently, just because of all the hype about AWS (in general and also at my work) I wanted to get started with AWS. Then I looked at their ridiculous pricing calculation page and I just closed the browser. I thought my time was better spent working on my project than learning about how to deploy a simple PHP application to AWS.

150,000 views a month is like a hit every two seconds. That's nothing. Talk to me when you're managing 150000 hits a second.

Edit: sorry didn't mean to imply that your site isn't successful, only that in terms of traffic, it doesn't make aws worth it.

I don't understand the point of your comment. The parent was simply explaining why they don't need extreme scalability with their scenario. No one is under the impression that multiplying their traffic by 75000 (so you work at Google then?) will not require a big server upgrade.

He's making a judgement on the value of aws using a use case that makes no sense to use aws, anyway. He could be doing something similar with lightsail on aws at similar cost if he really wanted to, though.

> He's making a judgement on the value of aws using a use case that makes no sense to use aws,

I think the problem here is that a lot of people use AWS when there's no value in using AWS. It's commonly reached for as a first choice when it's often not a good option if you're looking to run lean.

AWS is extremely popular, but probably only efficient for a small percent of companies who have wildly variable traffic patterns.

Your idea of "running lean" optimizes for the operational expenses of servers, which are across-the-board cheap, rather than cost of developer time.

I bill between $200 and $250 an hour. (If I had a full-time job I'd be salaried around $90-$100/hour.) Being able to pay me less because AWS's tooling makes life a hell of a lot easier once set up makes a lot of sense even for fairly small companies.

AWS tooling doesn't save developer time in my experience. In fact, I'd argue that optimizing for it more often than not wastes it.

I've watched someone spend days learning to configure a performant DB server on AWS due to their poor disk IO performance and spend an enormous amount to get a high ram instance when a simple SSD based server where more IOPS were trivially available would have had it working out of the box.

Everything has a learning cost though, so perhaps that's a somewhat unfair example as having a deep knowledge of data centers and backbone providers doesn't come free, but I can crank an ansible and docker script out and have pretty much all the advantages of AWS but deployed on my own hardware at a fraction of the cost. So I'm not sure it's fair to say that AWS tooling offers anything particularly unique to merits its market position.

So I'm not trying to big-time you, but I have experience across a wide range of environments and shop sizes (both in clouds and, unfortunately, people who bought the "VPSes are fine too" idea in like 2015) and, after being in these trenches for a while, "it doesn't improve developer productivity" reads more to me as "we don't know how to leverage AWS for developer productivity." Elasticity is nice; pervasive automation and deep introspective monitoring (which you don't have to create and manage and scale) is not merely nice but required, especially for small teams, because your small team cannot afford to be disrupted by systems that can't take care of themselves without kneecapping your velocity. I believe you're arguing in good faith but you're telling me that's this pebbly wall when it's actually the whole elephant.

And AWS reduces business risk, too, which needs to be understood and respected, too. Your VPS universe is not actually repeatable. It takes one line to roll out a full environment for more than one of my clients. Dev environment? Here ya go, self-serve bootstrapping and management. Prod environment? Infra-as-code guaranteed to be what you pushed to test at the infrastructural level as well as at the deployable level.

This is why we are replacing system administrators with developers and why we are replacing hands-on system creation with cloud stacks: because it's a difference of kind and the fears of higher operational expenses are trivialized by being able to use that endless inventory to replace the expensive part of your operation--the people.

That reads more like AWS marketing copy :) At scale you have to architect around AWS crapy network.

0 network transparency

abismal IOPS performance

very limited config. options

Horrible uptime (US East has worth uptime as a region than wast majority of quality DCs)

0 Access to people who can really help you (unless you are at several mil. per month spend)

You forgot to mention the choice between bad peering vs a high jitter direct connect. Two vlans on the same cable doesn't equal redundancy, two cables along the same geographical site neither.

I think you're either focusing on a very small part of the market - very large websites - or you're overestimating the requirements most websites have. Either way, it seems like you're taking about a very different scenario than I am.

Endless inventory and minimized administration but higher upfront and ongoing server investment is not a benefit when you need to write a website used by under 10k people a day. Nor is it a benefit for internal services moved to the cloud for nonsense reasons. Both of which I've seen done far too frequently. Simple tooling and small virtual machines or singular dedicated servers like are easily enough for these applications.

I can definitely understand the need for a small team working on a very large, high hit count application though. I'm not saying there are no uses for it, there totally are and you make valid points, just that the uses more limited than one might expect by the hype.

So...I run one service (gratis, as it's a friend) that gets about 8K uniques a day. It costs $26 a month in AWS (it was $35 but that's apparently gone down, cool!) and makes about $400/month. That price tag is despite me not being particularly cost-conscious when rolling it out. It does use free tier AWS resources, but EC2 is not in free tier right now; we're talking about keeping DynamoDB at free tier read levels and stuff like that.

And, unlike the still-really-weird-and-ad-hoc VPS world, stuff like DR is a solved problem when-not-if you need it. The vig for AWS is consistently between 20% and 25% and you are able to leverage implicitly all the incredibly useful tooling and systems around you. If 20% to 25% of $26 is going to materially damage your business, you do not have a business.

What's wrong with VPSes? Care to elaborate?

- Automation/APIs are generally lousy. Some try (Digital Ocean is trying to shed its past and become a real boy--err, cloud), but stuff like Linode is infuriating to work with when you expect to be able to just do something like "declaratively describe your infrastructure and go". Or when you expect to have monitoring and alerting on hand without having to reinvent every wheel yourself.

- Networking is usually real bad. SDNs are your friend. Yeah, learn you an iptables and all, but this is the future, we can do better. (DigitalOcean is almost to the point AWS was at, like, eight years ago with EC2 Classic? Something like that.)

- Geographically-centralized but independent systems are hard to come by and so fault tolerance is a Big Problem. AWS loses an availability zone, my stuff keeps rolling. Can't say the same elsewhere.

- Value-add services. The sibling comment's complaining about RDS, but RDS configuration hits probably the 98% case and You Don't Have To Learn It. I'm a little more hesitant about lock-in services like SQS, SNS, etc., but moral equivalents exist elsewhere for the most part, you can use them pretty effectively.

I thought that was the point chx was making, with sideproject giving an example. How many people really do 150,000 hits a second versus how many are using AWS?

How many projects actually get 150,000 hits a second?

(And are they not better served by buying and operating their own datacenters?)

Well, I've worked on two now... once was self-hosted by a large services company, which has its' own data centers and redundant connections, it also handles a significant portion of DNS for the internet (for good or bad). It worked for them, but was painful.. Getting new servers up was months of work.

I'm working at another service company expecting to clear 5M requests/day with bursts close to what you're talking about several times a year. We've had so much pain managing colocated servers, we can't justify the cost of 3 full data centers for just our load, that doesn't make sense. We're currently moving to a cloud provider to be able to scale out better when peaks really spike.

We had a customer that wanted to do several million requests in a 15 minute window, and can't currently handle that... We're restructuring/refactoring so that we can.

It may not be sustained, but getting 330k requests/second in bursts is a different way to think about problems than anything less than 1k/second, which many servers can hit without a sweat, and why I'd be more inclined to push for mid-level VPS like DO or Linode in those cases. Depends on need and expected growth.

How many hits/second does a project have to serve when they're on the front page of HN?

Shouldn't be anything incredibly high. I haven't seen any recent traffic info for HN, but a couple of years ago dang posted some numbers here: https://news.ycombinator.com/item?id=9219581

So at that point it was 2.6M views per day, which means HN itself was getting about 30 views per second. If you look at the 200k uniques instead (which might make more sense since any individual person will probably only click through once), that's about 2 unique visitors per second. So even if HN has grown a ton in the time since that post, I'd be surprised if it sent more than about 5 hits/second to anything.

Hah, the fact that HN itself is a single server (behind Cloudflare, these days) should be enough proof that anything linked to by HN is unlikely to need more than a single server behind a caching CDN.

Well, let's also differentiate computationally-intensive hits (e.g., users adding things to a shopping cart, requests to list Pokemon in the area, etc.) from hits to a home page, which should be static or at least cached aggressively. 150,000 index.htmls per second and 150,000 database writes per second are very different.

I used to run a university web hosting platform that currently has, I think, five web server VMs on about as many physical machines, two physical machines for load balancing, and two physical MySQL servers in active/passive replication (i.e., only one gets either reads or writes). We hit the front page of HN fairly frequently—for instance, we host mosh.org—and it hasn't really been a problem. I remember getting paged in ... 2009 or so? ... when a particular website in WordPress got to the front page of Reddit, but we had fewer machines then, and also I think we had not deployed FastCGI for PHP at that point (for complicated shared-hosting reasons), so each WordPress page load was its own PHP process via CGI. If you're optimizing for performance, even if you want to stay on WordPress, step one is to not use plain CGI and step two is to do one of the myriad things you're supposed to do for WordPress caching.

In any case, a handful of physical machines will handle being on the front page of HN just fine. If you're doing something where you have an extremely computationally-intensive process on the first page load and you're worried you might hit HN but you might not, put it on cloud and set up autoscaling, but other than that it probably doesn't make sense. If you know you won't scale too much—and a static site on the front page of HN isn't too much—chances are that your usage is so low that you're paying a premium for the unused ability to scale and you should just pay for two cheap VPSes, and if you know you will scale (e.g., you have a large fixed workload), again you're paying a premium for the unused ability to scale down, and you should just invest in a datacenter and save in the long term.

All that said, if you've got a static site, by all means stick it on a CDN, which I think is a perfectly defensible use of cloud for sites of all sizes.

I don't remember exactly, but not that many, maybe a couple thousand concurrent users. My brother's webapp has hit the front page of Reddit a few times, but a single dedicated machine was more than enough to handle that.

A hit every two seconds is 30 * (86400 / 2) = 1296000 hits per month.

150000 hits per month is an average of a hit every 86400 * 30 / 150000 = 17.28 seconds.

Yes, that's nothing. With a hit every 17 seconds, there is no performance optimization to be done. Therefore I don't understand the concern about performance in the parent comment by 'sideproject'.

Sure, there could be peak times when there is a hit every 100 microseconds and that's what forced the parent commenter to focus on performance optimization but nothing about this was mentioned in the comment.

Details about traffic in such peak times would have made the parent comment by 'sideproject' interesting. But with the current details in the comment right now, it is going to leave readers confused why one needs to discuss performance optimization for a hit every 17 seconds on average.

What are these "hits"?

Why not have a CDN serve static pages? Or better yet, an app on a mobile phone?

Then the hits are just APIs.

Guess most of us should never talk to you. Oh well, nothing of value was lost.

150,000 views a month is 150000/30/24/60/60 = 0.0578 qps. If there's no heavy processing you shouldn't need to be doing any optimization for the machine's sake at that rate. Slow queries / frontend code is a different story :)

The traffic likely peaks at certain hours

Not sure what your website is doing, but here's a quick thought exercise:

If you serve a page in 10s, then you can serve 259'200 pages/month.

    render_time page_view/month
    10s         259'200
    9s          288,000
    8s          324,000
    7s          370,285
    6s          432,000
    5s          518,400
    4s          648,000
    3s          864,000
    2s          1,296,000
    1s          2,592,000
So like, think 1 million views per month with 2s render time.

This is obviously skipping over a ton of details, but it's a good rule of thumb.

Website traffic can be vary heavily during some parts of the day depending on demographics. It's good practice, unless you know that traffic profile, to inflate by 2-3x your average page views per second over any time period larger than an hour.

In addition to this, one pageview may produce many requests. Both of these need to be profiled before you can estimate reasonably how much traffic a webserver can handle given it's current resources.

There are some good benchmarking tools that will load the entire page, including all it's resources, and produce a more accurate load measure in terms of r/s.

As a side note, those $5 vultr instances can handle a surprising amount of static requests per second using nginx.

This is all fine for a personal project. The amount that the time you spent optimizing would cost a company is very likely to outweight the cost of scaling hardware.

If I followed this advice, it would hurt my startup badly.

We would be investing far more time into system operations than I do now thanks to AWS's automation of standard stuff. My hosting bill would increase - partly because we can run on t2.micro instances, but that awfully glib advice about overprovisioning is most definitely asking for trouble; and we'd lose curated services like RDS and OpsWorks - which, by the way, are PostgreSQL and Chef, hardly a "ugly vendor lock" but simply well-designed infrastructure services based on standard parts. Oh, and I'd have to spend more money and time on auditors, because bare-metal providers don't have compliance programmes like AWS/Azure/GCP do. And I've haven't even started to think about securing the resulting systems to the same level I get for minimal effort from a major public cloud.

You're not a dinosaur, no, but these claims don't match my reality, nor that of many other projects besides. This is compounded in my view because you've claimed specific expertise to promote your opinion. Any manager bringing me such a poorly considered and bombastically argued business case would be sent away with an ear bashing about TCO and opportunity cost.

I think people need to make a conscious tradeoff between ~5x AWS cost compared to rented server vs. OPS costs.

Some startups (A) I work with have basically no OPS costs beyond setup, integrating Docker deployements and getting automatic backup working. Most simple technology just works and devs easily can do operations. The largest pain point still is VPN. Machines today are very very fast and load of many startups is very low. This are mostly simple marketplace startups etc. without any rocket science aka "web ui frontend to database". They often have <10 servers and are still over provisioned mostly due to HA requirements.

Some startups (B) I work with have high OPS costs due to demanding technology needs for throughput, load peaks, amount of data with innovative technology at their core, aka "rocket science".

I have not seen any correlation between AWS usage and A/B types.

From my experience with startups the only way to successfully use AWS is deep integration and using lots of services. If you use AWS and do everything on your own people are doing it wrong.

I managed a metal to AWS transition, and 5x definitely doesn't match the costs I experienced (in this specific case, it was around the lines of 1.2x).

I don't mean this case to be universal, in particular, I think cloud services force applications to have a particularly good/modular design (which is a cost in itself) - where, with metal, as you wrote, you can relatively cheaply overprovision.

I think the analysis you're making overlooks some important characteristics of the infrastructure engineering aspect.

Some typical network/infrastructure elements, in particular firewalling, load balancing, and network management don't necessarily belong to the "rocket science" type of application; they are easy to overlook in "type A" services, ending with a "kind-of-HA-but-not-really" infrastructure, which is ok, but it makes the comparison cloud <> metal not really meaningful, as in the cloud, those features are baked in ("almost" for free).

I'm very skeptical for example, that the 5x figure includes hardware for the above network equipment and management.

To summarize, it's perfectly fine not to have an "advanced" infrastructure but it must be highlighted that such conditions make a direct comparison incorrect.

Disclaimer: My experience is with growing startups younger than 5 years. So everything I say needs to be seen in that context.

From what did you transition to what? 20% sounds like a very small markup, from my experience and from other people experiences documented on the web it's much much larger (at 10x?) than your experience.

So I would be rather interested in more details as many people ask me about Amazon transitioning, and 20% markup would be killer.

"Some typical network/infrastructure elements, in particular firewalling, load balancing, and network management don't necessarily belong to the "rocket science" type of application;"

I surely do not know your demands, but firewalling, load balanicng etc. looks rather easy to me today for everyone except Google, Amazon, LinkedIn, AirBnB and 99% of startups are not one of these.

"I'm very skeptical for example, that the 5x figure includes hardware for the above network equipment and management."

Not sure what you are using, the hardware would be around $40 per month for that kind of network architecture (FW,HAProxy,Nginx,...).

In my last job we had large NetScalers which where much more powerful then HAProxy/Nginx on a rented server, and I assume AWS is as powerful, but for most of my clients this would be huge overkill.

Not sure what you are using, the hardware would be around $40 per month for that kind of network architecture (FW,HAProxy,Nginx,...).

The LBs you're talking about are software. I was referring to hardware solutions; a soft load balancer is still a good solution, but brings back to the problem I've made before - unit granularity.

Do you refer to two (two is the minimum required for HA purposes) dedicated load balancing machines, or mixed services?

In the former case, metal is not very convenient, as the minimal unit even for a pure LB machine, is still expensive.

In the latter case, it's hard to say, buy I think there is plenty of middle ground between a small startup and Amazon, where the cloud granularity is helpful and cost-effective (I'm not implying that it's generally cheaper than metal, though).

I'll write two answers. This is a description of the infrastructure.

Our base environment (we have several) has 4 servers, 2 of whom are used for the app servers and load balancers, and 2 for data stores and queue processors.

Each server is the typical (as someone in this thread named) "8k" server (a bit more costly, actually). Generally speaking, the servers are significantly overprovisioned.

There are a couple of factors that made the conversion to AWS cheap (~20+).

The first is that the base unit of a metal server is very large (1 server). Although it's cheap to scale vertically, scaling horizontally, for HA purposes, is expensive, because it costs at least 2 units.

For example, the total power of our app servers is overprovisioned in the 20x range (CPU and memory are cheap, right?). Even with minimal speccing, a metal server still costs around 3/4k, you need 2, that's 6/8k.

In AWS, we can work with very small units. So maybe we end up paying the same amount, but we don't need that excessive power, and, crucial, we have networking for free (or almost).

The second factor is networking and hosting costs, which are not trivial. We rent managed firewalls and load balancers, which in AWS are for free or almost.

Also, it's important to spread the servers cost over the time - any metal server won't last forever. If you buy one for 9.6k, it's 100$ a month for 8 years. When it breaks, HA goes temporarily out of the window until it gets fixed (or you buy a new one, or you move services around).

The big pain point of AWS [for us] is RDS, which is madly expensive. It accounts for something like 50% of our AWS costs.

A very gross estimation of the monthly costs of each metal server could be:

- 80$: server - 80$: hosting - 80$: managed networking

For 4 servers, that's almost 1000$. Adding 20%, with a budget of 1200$, with AWS, we have a less powerful but also correspondingly less wasteful infrastructure, with lots of baked-in functionality (including, more flexible HA).

I think the "5x" (though my experience is closer to 2x-3x) tends to come in if you already do the good/modular design and are set up to scale into hybrid setups as needed to handle traffic spikes. In those instances you don't overprovision, or overprovision maybe 20%, knowing that extending your virtualized setup into a part on premises or manage hosting setup, part public cloud is fast and seamless.

E.g. I had a setup that spanned on-demand instances, rented managed servers, racks in two separate colos and racks on premises. We expanded resources whenever it was cost effective at the time. Generally the colos won out, with on-demand instances handling traffic spikes, and managed servers primarily used for locations we did not have staff.

When your infrastructure is designed so that adding a new one of any of those is just a matter of assigning IP space to the new satellite network and deploy the first instances - whatever they're physically on - your utilisation of all the resources can be far higher.

E.g. in this setup we have instances where we move containers seamlessly between the UK, New Zealand and Germany currently depending on load, available resources, and which instances need low latency (Germany vs. UK makes a roughly 8ms latency difference despite going over an encrypted VPN connection, so we've even had times where client traffic hits load-balancers in the UK while the web servers were temporarily in Germany because it happened to be cheaper to expand there for a while (and contrary to with AWS, our bandwidth costs in both locations are trivial).

If you're comparing to "lets throw a bunch of servers somewhere", then, yes, AWS probably won't be that much more expensive, and presumably that is a big part of why so many people gets caught out by AWS costs once they start scaling up.

I used to work at aol, and I think 5x was on the low end of what we experienced, if you just forklifted an app over. You can bring that cost down eventually by rearchitecting and laying off/reassigning some ops guys, but that takes time.

I really struggle with this color of money nonsense that western finance has invented. Tell me again why hiring a developer is cheaper than buying one $8k server that will save us 35 man hours a week? Drives me batty.

Rearchitecting may be necessary anyway, but doing this kind of work involves hiring and training better senior devs and retaining them for a couple of years at least. That's not cheap. It's A lot more expensive than those OPS guys you laid off.

I think what both of these scenarios share is that they're not about saving money. They're about empire building for the Dev manager.

It's not so black and white. The changes mentioned may be a necessary evil when scaling; in this perspective, moving to the cloud makes a set of solutions (rearchitecting) happen sooner than later.

Also, an 8k server is a big unit. Cloud services are much more granular. This is a problem of bare metal - it's easy to overprovision because the base unit is large, and one ends up being happy of having an "overprovisioned" system, when in reality it's money down the drain.

Also, you don't count the management of the 8k server. It may (but not necessarily) be at a click distance; if it is, the management hardware (eg. one from a very famous servers producer) may have a poor software.

There are reasons why, in some cases (of course, not in all or not in many), cloud may be more advantageous than an "8k" bare metal server.

All in all, I think without numbers, talking about metal vs. cloud in abstract, generic terms, makes a poor argument.

> Also, an 8k server is a big unit

My point was that it's peanuts compared to the labor costs it saves. I've worked quite a few places where a single $2-3k server would have saved every employee around an hour a day. In some cases several machines would have saved an hour each.

Every person you don't need gets you more than one person worth of increased productivity due to scaling limits, (see also Fred Brooks, IT and HR - more employees, more support staff).

One of the other aspects is that cloud stuff tends to make very apparent stuff that was always true but often ignored (for one reason or another). Like--that rearchitecture would be a good idea regardless of where you are (because stuff like twelve-factor apps are just a fact of life), they just become more obviously necessary when everyone's always hyping up that Anything Can Disappear At Any Time.

If you're just lifting apps up and plopping them on AWS, it can totally be really expensive.

(That doesn't happen when you're building from the jump for AWS or another cloud, though, unless you're messing up on a deeper level.)

I do contract devops for both physical servers (that includes me occasionally travelling to a data center), managed servers in a full service colo, and AWS, and the overall cost per server consistently ends up higher including all devops time for the AWS instance in my experience.

I love it when customers pick AWS (though I usually advice not to, unless they have very specific needs), as my billable hours are way higher for those clients, though it is annoying having to deal with the inevitable "why is my AWS bill so big?" after I'd told them exactly why it'd be expensive in the first place. This is particularly true with bandwidth heavy setups, where AWS charges tens of times more per TB transferred than e.g. Hetzner.

Often I even end up getting paid to help them get off AWS again down the line when they realise how expensive it is.

Overall the idea that managed servers and even bare metal colocated servers take up so much more operations time is just not what I experience. Even if it did, if you're even moderate successful it'd need to save you a crazy amount of operational time per server to make up for the price differences in the hosting.

Further my experience is that you don't need to over-provision much exactly because setting up hybrid setups where you spin up AWS instances (or any other cloud) to handle spikes works well. The trick is to treat managed servers exactly as cloud instances, apart from at most the initial hardware provisioning (though many hosting companies provides APIs for this so you can abstract away that too), so that it doesn't matter where you provision.

As for services like RDS etc., they're a very mixed bag. When they work for you, great, though they're expensive, but very often I end up with clients having to move off them because they need some plugin or other that isn't available. Very few of my clients in the end stay on them for very long. Their biggest benefit is to defer the initial setup of a self-managed cluster.

That doesn't mean there are no cases where AWS is the right choice - for starters having it there for traffic spikes is great (though we usually end up needing it very rarely), and if you need large batch jobs or other environments you spin up/down frequently, it may be cost effective. But it's extremely rare I see cases where it's cost effective for base load.

The typical argument then is that big companies wouldn't have picked them if it wasn't. But big companies don't pay the prices mere mortals pay. I know concrete examples of large negotiated discounts for a couple of larger companies, and they are steep.

> and I'd have to spend more money and time on auditors

That might be true, but most companies are not in a position where they need their IT infrastructure audited. This might very well be a niche that makes it worth using AWS.

> And I've haven't even started to think about securing the resulting systems to the same level I get for minimal effort from a major public cloud.

My experience is that getting security right on the public clouds is harder than bare metal. If you take the effort to do it properly, the end results can be good. But a lot of that is simplified a lot in a colo'ed environment by simply physically separating resources into different network segments and the like.

For people with very complex requirements, you might even get better results, but I could never agree with "minimal effort" - it's very common to see people badly misconfiguring their IAM setup for example, because it was too hard for them to figure out how to open up just the specific things they needed to open.

All our duelling anecdotes aside, you've picked out the one category I would most definitely not host on AWS: high constant outbound traffic sites. "Don't build your CDN on AWS", I can agree about that. Using public clouds for the spiky part of an otherwise predictable steady-state workload is also a great strategy. Also this:

> Their biggest benefit is to defer the initial setup of a self-managed cluster

And that, I think, is why (nearly) every startup, or indeed every new project, should begin on a public cloud. The capital waste of an incorrect server purchase can ruin a project and makes "fast failure" hard to stomach.

I don't move services away until an alternative is clearly both cheaper in 3yr NPV, and will not constrain future business opportunities. Except in cash cow operations where innovation has ceased, I rate the second criteria more important than the first and a compelling reason to stay on a public cloud.

I tend to favour starting with rented servers too, rather than purchases (or leases, rather) - whether that's on AWS, or rented by the month then becomes a much simpler comparison.

As for the CDN, we absolutely agree. I noted elsewhere that even if you host on AWS, if you have any kind of volume of outbound bandwidth use, you should probably get a CDN elsewhere whether or not you think you need a CDN. A good caching CDN can for many people cut their AWS bill dramatically without making you move anything else off AWS.

What about CloudFront?

As mentioned, its traffic charges are as high as EC2/S3. CloudFront is useful if your goal is to reduce load on your EC2/S3 setup for reasons mostly other than cost, and want an all-AWS stack.

If cost is your reason for looking at a CDN, or a big part of it, CloudFront will do very little for you unless your pageviews are extremely costly in terms of compute relative to the amount of data returned, and said data is very cache-friendly.

That's very much a niche requirement. To date I've not come across a setup where CloudFront made sense cost-wise.

Good to know. I've noticed that NixOS hosts their binary package caches on CloudFront, which I assumed was because it's cost effective. Those URLs are completely content addressed so they're optimal for CDN caching. Is there a significantly better alternative to S3+CloudFront? I'm pretty influenced by brand names in this situation because there are so many random hosting companies out there.

You can keep S3 if your cache hit rates are good enough and still save a lot. S3 is great for the storage for e.g. durability, as long as you can avoid serving up too much of your content front it.

To be honest I rarely use external CDNs and instead "roll my own" with a variety of providers as with cloud providers + geoip enabled DNS you can get 90% of the benefit at very low rates, but in terms of "brand name recognition" MaxCDN is worth checking. It's not nearly the maximum saving you can get, but it can be substantially cheaper than CloudFront.

Traffic pricing for CloudFront is similar to standard traffic pricing on AWS (e.g. for EC2 instances). They charge you also per request and user origin a bit extra, so in some cases maybe even more expensive.

For the life of me, I can't figure out why hackernews is still always talking about AWS. Every single time I've tried to look at it for a use case of mine it's the most expensive possible option. Cloud? Sure, I see the benefit and plan to go on the cloud (probably Azure, IBM or so). AWS? I don't get it.

Yes. Look at https://www.packet.net/bare-metal/ for bare metal boxes provisioned in less time than an ec2 instance.

Or if you want to go old school, cheaper, and less sexy/api driven: https://www.delimiter.com/

If you shop, for around 30-40 a month you can get 16 cores, 32gb ram, and a 120GB ssd or 1-2TB spinny disk.

I migrated to packet all bright eye'd and full of wonder...terrible mistake...outages on a weekly basis and they have no availability zones so when outage occurs the entire region is down. Replicating across regions failed our latency requirements so that was a no-go.

I can't stress this enough. Having to roll your own / administrate a logging system, Database, load balancers, autoscale, file storage far exceeds the added cost of the cloud. Now that GCE has sustained usage discounts of 39% it makes no sense to go bare metal right now.

Quick back of the napkin math, but Google Compute Engine (while it is shared) is significantly less expensive than Packet. This is just comparing purely on price though.

  Packet - $292/mo
    4 cores
    32 GB memory
    128 GB SSD

  Google Compute Engine - $176/mo
     $156 - custom-4-32-extended
      4 cores / 32 GB memory 
     $20 - 120 GB SSD storage

"4 cores" on GCE means 4 hyperthreads, or 2 physical cores.


> For the n1 series of machine types, a virtual CPU is implemented as a single hardware hyper-thread on a 2.6 GHz Intel Xeon E5 (Sandy Bridge), 2.5 GHz Intel Xeon E5 v2 (Ivy Bridge), 2.3 GHz Intel Xeon E5 v3 (Haswell), 2.2 GHz Intel Xeon E5 v4 (Broadwell), or 2.0 GHz Intel Skylake (Skylake).

A raspberry pi would also be more than enough for many use cases and can be hosted for ~$40/year: https://raspberry-hosting.com/en/order.

I'd have one (or more) already if I didn't live on the opposite side of the world to the data center.

also scaleway.com (they have API, but no user-data).

IMO, docker and container orchestration spells a bright future for bare-metal boxes like these, as you won't need cloudformation, etc..

But I still see few alternatives to S3, many vendors offers block devices, but only the big clouds offer blob storage. Backup and restore from blob storage makes recovery from crash pretty easy.

How does docker replace cloudformation? Don't you still need something that says "hey you are running out of capacity soon, time to add more hardware for your software to run on"

There has to be something that adds more bare-metal for your docker containers to run on when the existing bare-metal reaches capacity, right?

A fully containerized setup allows one to use something like Kubernetes to do such orchestration.

K8s is getting pretty easy to setup these days.

I was asking about once your bare metal capacity reaches its limit. At some point you need to provision more bare metal and expand the total resources kubernetes can consume with docker.

Cloudformation, to me at least, is the power to expand resources for large traffic events. Most of the time you can get by with a small amount of instances but its nice when it scales up to hundreds of instances in minutes.

The bare metal equivalent would be to buy something that can handle the peak load from the start, right?

>Cloudformation, to me at least, is the power to expand resources for large traffic events. Most of the time you can get by with a small amount of instances but its nice when it scales up to hundreds of instances in minutes.

This is not what CloudFormation does. CloudFormation allows a declarative way to express a group of AWS resources to be created and coupled together. There's nothing that's quite exactly the same as CloudFormation, but stock Kubernetes is quite close, since you effectively describe what resources you want in individual declarative YAML/JSON files. However, there's nothing standard in Kubernetes for coupling them together into one thing equivalent to a Stack in CloudFormation.

Maybe we are talking about different things, but I use CloudFormation to autoscale ec2 instances based on avg CPU load over a period of time


How do open source solutions like minio.io on bare metal compare to s3? I don't have much experience with any of those, but it seems like a compatible solution.

I haven't tried it either... but the nice thing with S3 is zero maintenance. A disk or machine dies it's not my problem.

The not-my-problem aspect is really powerful.

I rent a dedi arm box from scaleway. It's been up for 700 days, and serves about 30k req/month. I couldn't be happier for the price.

For 30k req/month heroku could a better choice. The cost here is maintenance and backup.

Note: I'm still hopeful that things like k8n will make running stateful containers on random metal easy without huge investments in configuration for backup and reliability.

curious - but how would heroku be a win? I know nothing of heroku, so not insulting it.

I pay about $3.50/mo total to run these services.

b2 is a really good alternative to s3.

I assume you mean: backblaze.com/b2

I would be concerned with how good the connection is... Ie. if you use packet or scaleway, how stable/fast is the connection to b2.

S3 being internal network at EC2 is a killer feature too :)

That price for that config sounds amazing. Any pointers?

Right this second: https://cc.delimiter.com/cart/dedicated-servers/

$50/month 24GB ram, 16 threads, 2TB HD, gigabit uplink (20TB/month free).

It's not hard to wait until they have a sale and/or coupons and/or pay upfront yearly to get a similar config for 30-40.

(disclaimer: these servers are pretty "unmanaged")

Or packet gives you a completely "cloud/api-driven" experience that's still on bare metal and reasonably priced compared to AWS or DO.

Where do you find the discounts/sales?


32€ quadcore 32GB RAM, 2x3TB HDD: hetzner.de in their "second hand market"

I've been shopping for a few months, and I can't find 16 physical cores (32 threads) for less than about $70/mo.

If you're talking about a single server, I doubt you will find anything cheaper. Compare it to AWS, something like c4.8xlarge will cost you $1.591/hr, or $735.84/month reserved.

What is that $70/mo server with 16 cores?

The cheapest I see is Kimsufi KS-5 8-core for $36/month. There's also an EU hoster Netcup that offers RS 4000 8-core for around $33.

Less time than ec2? Last time I test drove packet it was around 10 min

That, unfortunately, can be less time than it takes AWS to provision the disks on some of their larger instance...

OK, what happens when your power goes out? What happens when a system goes down? Are you including the cost of your infra/ ops team in your server costs?

What about the larger upfront investment for hardware, especially over provisioned, relative to a spread out payment? What about the 5000 credits you get upfront from aws for startups, and the free tier?

As for vendor lock in, this isn't a problem most companies face - they're fine buying into AWS, and you're underselling the many, many other services they provide beyond EC2.

Interestingly this is where Google does great with fast VMs and disks along with sustained use discounts. Add in the committed use discounts and you get very close to dedicated server pricing, while getting all the flexibility and resilience of their cloud platform.

Sadly their support quality has decreased and our own account team has ignored us for weeks so maybe stick with AWS instead.

Hey there, work at Google. Please feel free to ping me with your concerns. That's certainly not the experience we strive for.

Cloud services don't save you money, they buy you focus.

Only slightly tongue in cheek: Imagine the focus you could achieve with dedicated personnel taking care of all of the operations concerns for your devs...

If you know good ops people, you have additional options. Hiring ops people and hoping they can out-value a cloud provider is a risk. Good ops people can cost as much as good developers.

If you're a larger business, you can probably absorb that risk if it goes wrong. If you're a startup, it could be catastrophic.

However, I also know larger businesses that just simply do not have good ops teams and a cloud provider would outperform all of them.

Almost every argument that you are making here could be exactly applied to developers and contractors. Hope that they can out-value a contractor. They're expensive. Some larger businesses have terrible dev teams and a team of contractors could out-perform them all.

> Good ops people can cost as much as good developers.

And this is a surprise? They bring a ton of domain specific expertise, good automation experience, and they lift the burden of managing your systems from your developers, so they can work on features and not scaling.

If you can afford to put good people in every position, by all means go for it, but that's not been how startups and small businesses hire, or else they wouldn't have to use the phrase "we wear a bunch of different hats here at St4rtUp". The apparent trend has been to overload the developer with additional responsibility instead of, e.g., making ops people build product features.

The problem with the rent-a-server approach over buying AWS is that, in many cases, your engineering team will need to implement a ton of APIs, tooling, monitoring, alerting, etc. to get the same quality of service that you get from a public cloud vendor "for free." Which is great for us engineers since it's interesting problems that need solving, but not so much for the PMs that need to justify that spend.

Also, when your web service scales to millions of requests per second from millions of requests a month or week or whatever, having elastic compute that can scale with that is a blessing. Because nobody wants to wait for page loads:

The problem with renting boxes is the hidden costs if you want to do it right.

First of all, if you have anything mission critical, you need to run it in a high availability config, this is easy for stateless microservices, but when it comes to running your DB, you start renting three boxes instead of one or two and configuring them accordingly.

And then you setup your Backup Infrastructure for disaster recovery, Glacier needs a replacement after all. No problem, just more disks(?) on a few more boxes(?) and bacula(?), better in a different Datacenter just to be on the safe side, it would be nasty if you whole rack gets fried and your data with it.

Don't forget to backup your configuration, all of it. Loadbalancers, Server Environment Variables, Network (do you have an internal DNS?), Crontabs, some businesses need their audit logs stored etc...

On the infrastructure level there is lots and lots of stuff you can do and you won't ever really need AWS, you'll just spend significantly more time finding and administering the right solutions than just using the AWS Solutions where you'll find a treasure trove of great tutorials and can relatively cheaply pay for support.

If you then pay someone on top for 24/7 management/monitoring of your dedicated stack so that your team doesn't have to get up at 3 am because one of your VMs disk fills because some stray logfile is filling the disc, many of the savings you had by setting it up on a dedicated server go out of the window because the management partner needs to train their people to look into your infrastructure. AWS only Management Partners are just light-years cheaper because they can streamline their processes much better.

You could also hire your own team of admins...

Sure AWS is a beast with its own surprises, but overall the cost/benefit ratio is still very fair even if you factor in all the "surprises"(many of which your management partner will probably know about). Having layered support is really something beneficial aswell.

If something is wonky with RDS, you get to call your management partner if he didn't detect it before you, who if he can't tackle it himself can call AWS technicians. This gets you much much further than you would get elsewhere. The outside the world is paying for (for example) perconas consultants or someone similar if the problems grow over their team's head.

Sure, at some point in a companies growth, depending on how technical the operation is, there might be a time where an admin team and colocation/dedicated boxes make sense, where AWS technicians will scratch their heads etc., especially if you have some very very specific tasks you need to do.

But for most people this is far off if ever.

I completely agree with your line of reasoning and I have a few more thoughts on the matter. There are a ton of "soft" problems that arise when building out your stack. Licensing, warranties, and maintenance issues are things that never come up with a cloud provider. Then there are physical problems with building your own cabinet like cabling and power & heat management. The list just grows and grows.

Another big thing that rarely gets mentioned is research into exactly what hardware to purchase and how to configure it. Do you know what compute hardware you should purchase? There are 10s of vendors and thousands of options. What about network hardware? Do you know which switch is the best for your stack?

With a cloud vendor all of these questions disappear.

Even if you make a mistake in your cloud provisioning, it's easy to correct; just shut it down and start over. Make a mistake buying your own hardware and you have to live with it for 3 years or pay purchase more hardware.

I think people tend to forget or ignore all of these costs when evaluating cloud providers. You look at the total bill each month and are surprised that it cost that much. However, the costs for your custom built architecture are likely higher, but they are spread out over more time and more projects.

There are several options between "full cloud" and "building my own datacenter": for example, you can rent dedicated servers by the month, where you get all of the hardware power and you don't have to commit money upfront.

The 2 extremes you cite might be good for some people in a narrow set of contexts, it makes sense to include other possibilities when evaluating your hosting options.

>Then there are physical problems with building your own cabinet like cabling and power & heat management.

Did I stutter? I said no colo. You need to be ridiculously large for colo to be worth considering.

Honestly? Almost no one does this. Websites do go down if their infra provider goes down. Yeah, I will have a secondary ready to go with manual failover but backup infra? Nah.

If you do a photo-app or a to-do list that can be the attitude.

If you have customers with enterprise needs paying enterprise prices, you need to keep your SLA. Think finance, e-commerce, etc.

If your Infra Provider has the same SLA than you do, you have problems and should do the math and think about backup infrastructure or what ever else is necessary to have a chance of upholding that SLA.

Edit: I just now see that you misunderstood me. With "backup infrastructure" I meant an infrastructure to do backups, not another new infrastructure to sit there and collect dust awaiting disaster. That's mostly not necessary.

It's not just about scaling up, but also down. In my experience most of the startups initially greatly over-estimate what they need, and end up with only small part of the resources really used. With cloud you can provision what you think you'll need for a few weeks and then, once you get actual usage data, you can scale and fine-tune the setup to fully utilize it (and optimize the costs). And it's done with a few clicks in just a minute or two, without having to manually update anything, anywhere. No need to change DNS zones, no need to copy the data, all the tedious work that's usually involved with moving to a new unmanaged dedicated machine. With EC2 I can do this as often as I wish, while your hosting company would probably go crazy if you'd ask them to provision you a new hardware every week, or change it twice a day because you've changed your mind.

> With cloud you can provision what you think you'll need for a few weeks and then, once you get actual usage data, you can scale and fine-tune the setup to fully utilize it (and optimize the costs).

Or in a more realistic scenario you forget to dial back an are paying $200/month extra for unused provisioned IO for years before anyone notices it. Happened, even though I was looking for improvements after every bill.

True, but did you ever scale down an already setup & running dedicated machine? With dedicated hardware it's always like "if it works don't touch, and we might need it one day anyway" :)

I guess the point is if you own the machine, there is no reason to scale down, because you already paid the costs upfront. In conclusion cloud providers solve some of the problems they create in the first place very well ;)

I don't disagree with your point per se as I do love working with physical gear, but I do think you're grossly missing the point in places:

> "My take: a single machine (or two for HA) will be enough"

2 bare metal instances isn't HA. Not even close.

> "if you really want to go big separate the web server from the database but that's it."

I would always recommend separating the web server from the database server on anything professional. It gives an easy clear path for scaling sideways (since you've already separated out your back end from your application), it allows you to tighten security (eg only allow access to the DB server from the web servers via the unprivileged DB user connection), it also makes maintenance easier. Even if you're only running on one physical box, put the web server and DB in their own VM or LXC/Zone/Jail container.

> The other day I saw quad E7-4870 (yeah won't win any single thread contest but has 40 cores and 80 threads) 512GB RAM servers for $299 a month, with 1TB RAM for $499. Had a low end 2TB SSD for boot and you could add 8x1TB HDD w/ HW RAID for $40...

I work with both bare metal servers matching your description and both self-hosted and private clouds. Frankly I think your rant misses one of the most important point of working with AWS and that's the convenience and redundancy that the tooling offers. AWS isn't just about single instances, it's about having redundant availability zones with redundant networking hardware and about being able to have disaster recovery zones in whole other data centres and having all of the above work automatically. Getting our self hosted stuff to even close the level of tooling that AWS offers took months of man hours and quite a considerable more initial set up costs. Having to buy at least two of every piece of kit for redundancy, having to have BT lay two dedicated internet links (we have 3 now) just incase a builder accidentally cuts one of our lines and having core infrastructure replicated off site all adds considerably to both the set up time and cost. So yeah, for small businesses and personal blogs AWS is a bit overkill. But you cannot use the "high availability" argument and say "2 physical machines is enough".

Disclaimer: I've worked for clients such as Sony, UEFA and News International as well as many smaller but still sizable national publications. Our infrastructure has consisted of both scaled up physical hardware and scaled sideways virtual machines and frankly I/we wouldn't be able to offer the kinds of services we do nor with the kind of uptime we do without running a fleet of virtualized web servers.

Let's be real on high availability. If you are honest with yourself, on the cloud that doesn't mean 2.. AWS regions but 2.. cloud providers. It's a yearly occurrence now that 90% of SaaS stop working because AWS is broken, and it's not any of the actually redudant parts like power supplies that are broken, but because a human pushed software or configuration and the whole thing came crashing down.

Idealistically I agree with you but pragmatically I think more than 1 cloud providers isn't really worth the effort. It's not often that a whole region goes down but even then I can't recall when the whole cloud platform last became inaccessible - usually it's just a region.

But once again it comes back to SLIs and client expectations.

But then, "more than 1 cloud providers isn't really worth the effort" and "2 bare metal instances isn't HA. Not even close." are really incoherent.

Two machines with two internet connections and a good UPS easily match the availability of AWS.

Not really as HA on physical hardware would be more like 4 machines if you factor in DB plus 2 stacks of switches. But really if you're running HA then you'd probably want 3 web servers rather than 2 so you can perform maintenance and still have redundancy. Which means you'd also need 2 load balancers and some method of code deployment, which will usually mean at least one other box or SAN. If your application is database heavy with lots of reads then you might also want memcache / redis. Or maybe other caching servers like varnish. Bare in mind that if your site is slow and unresponsive then it's as good as unavailable.

This is all in one physical location as well so you'd need to double this spec again.

Then once you've built all of that, you'd probably want to put it behind a CDN as leased lines are expensive.

Only then you're starting to reach feature parity with what I've described in my first post and there will be lots of kit I've not even touched on.

However even if you do just run 2 VMs (web and db) on each of the 2 physical boxes, and don't need redis etc. You still need to double your spec just for the multi-region point I raised earlier.

> So yeah, for small businesses and personal blogs AWS is a bit overkill.

Even then the answer is: it depends what you need. I run blogs on S3 + CloudFront. That's effectively content versioning and geo-distributed caching CDN for pennies. AWS is not just EC2.

True. One of my personal sites is a static site hosted on S3 and it costs me literally just $1.50 a month.

I always think of this as the "Stack Overflow approach" since I was listening to their podcast when they were building it and that's what they elected to do: https://blog.codinghorror.com/building-servers-for-fun-and-p...

The last sentence pretty much sums it up..

"I know that hardware is cheap, and programmers are expensive."

The way people here are talking makes it sound like they are spending man years setting up a couple servers. From my experience the time consuming parts are messing with the software stack. A problem none of these vendors of machines/VMs/etc really solve (ignoring way overpriced shared databases/etc). Buying a server installing an OS/whatever on it, tossing it in the rack and providing an IP/VLAN/etc is less than 1% of the time/effort spent on spinning up an application. Plus if you buy used, its possible to get 3 year old machines for less money than AWS charges for a month... To which I always hear about "reliability" of old hardware despite the fact that a large part of this rented hardware is just as old, and outside of disks/SFPs and batteries none of it really dies of old age.

As others have said, the effort to "cloudify" your application is probably more than the effort to buy some massively over-provisioned machine. Sure all the big boys need all this fancy management, but your startup with a few hundred thousand hits a month can probably be run on a 5 year old machine if any effort at all were spent assuring its not doing something stupid that results in second long page responses.

My love for AWS is their availability zones.

You can be anywhere in the world and have ridiculously small latency.

But the cost you pay sure is high.

I feel the exact same way. I run a 45M pv/mo network on a single dedicated box. The cost of our bandwidth alone at any cloud provider costs multiples of what we are paying for our current server. The cloud just doesn't make economical sense for a lot of workloads.

Being part of a project that utilized 5 dedicated severs at some point in time, I hear you.

Well, I agree wholeheartedly https://joelkuiper.eu/cloud

Or better yet - build your software as open source that enterprise clients or end-users can run anywhere. Eliminate "the cloud" as a single point of failure.

Rent a server: do you mean dedicated hosting or something else?

Dedi, yes.

Your giving really general advice that fits YOUR needs and experience. Meanwhile, my company has been in business for 10+ years and we have 60+ severs streaming a shit to of data 24x7.

Running "a few" dedicated servers is not some magic catch all for everyone.

If you're streaming any kind of volume of data from AWS, you're pretty much burning money compared to other services, unless you've secured steep discounts on the published prices.

I too have run networks for companies that's been in business 10+ years, with as many servers, and I've done that on both AWS, colo'd servers, managed servers, and hybrid setups of all of that, and I've yet to see an instance where AWS was cost effective at published prices for base load.

I have seen AWS be cost effectively used to handle spikes or batch jobs, and I have recommended it for clients that care more about the brand name (to tell their customers for example) than cost, or have very specific needs. I have also seen it used cost effectively once you get big enough to secure steep discounts.

Knock yourself out. But, not everybody is doing the average.

Ah, that nasty AWS vendor lockin. Guess I'll just run my own CDN then, shouldn't be hard (can't use other vendors, because lock-in). L'il ol me running edge nodes all over the five continents where we have clients. I'll have it done by closing time today.

AWS is more than just compute cores, and you're hand-waving away what it offers.

To be fair though, even with a setup like this you can still use Cloudfront, or any number of other 3rd party CDNs. Most of these providers including Cloudfront offer 'remote origin' features/configurations.

I agree that AWS offers much more than hosting, but when talking about hosting specifically the numbers are clearly better with 'bare metal' (rented or otherwise) -- it's just substantially more power per dollar. You'll need to do a cost-benefit for your own situation so YMMV of course, just as it would with any other provider or any other service your business may choose to subcontract.

I think it’s clear what was intended. Nobody disputes that targeted uses of third party infrastructure services can be appropriate. After all, it’s relatively easy to migrate a service to a new CDN - but it’s a lot harder to migrate from AWS when it’s a core part of your application architecture!

I'm just tired of seeing Yet Another AWS Debunking on HN that is solely concerned with hardware specifications and ignores the myriad of other reasons someone might go for AWS. Particularly someone who doesn't have access to 10+ year veteran neckbeard skills.

The CDN offering is one of the most commodity and portable services available. Plenty of different CDNs on the market that all support any origin.

If you're talking about actually running a global application then that would be a scenario where a network like Google's or Softlayer's does help by having 1 giant VLAN.

CloudFront is one of the worst CDNs out there, so this isn't really a good point.

Any CDN can be stood up in front of a small fleet of dedicated hosts, and you'd still be saving 80%.

In fact, if you are using AWS you really ought to put a proper caching CDN in front, given the crazy high bandwidth prices at AWS.

The bandwidth prices are high enough that I at one point mulled over setting up an "S3 compatible" storage service using S3 as the backend for durability, but storing a single local copy to most-of-the-time avoid hitting S3.

S3 bandwidth prices are high enough that there are huge cost savings to be had if objects are retrieved reasonably regularly.

It was just a simple example that AWS is more than compute cores and disk space.

I think the comment does ring true in many cases however, where a minimal infrastructure is more than sufficient. We recently received notice that an EC2 instance was being retired and we needed to relaunch. It had no load balancing; it was just an m1.medium with a standard Java app on Ubuntu, with a static IP address, and it ran for like 5 years. No elasticity etc. Essentially 1:1 for a standard "old school" setup.

In other words, not really an appropriate (or at least not cost-efficient) use of EC2. EC2 != VPSes

That might make sense for your (organisation's) business needs, but should you continue being that pragmatic you'll end up being an unemployable mess.

Just do the whole Docker thing, the more dependencies and lock-in the better, update your CV and move on.

While the article is factually correct, the tone strikes me as being disingenuous. The problem seems to be that the servers were running on gp2 disks, which offer a performance baseline with free short term bursts based on credits collected. The author has just realised that for consistent throughput, they would have to choose provisioned throughput and pay accordingly.

This isn’t some conspiracy by AWS, though. It’s all in the documentation and isn’t even hard to find. If you want X ops per second baseline with occasional bursts you pick option A, or if you want consistent Y ops per second you provision and pay for Y. Read the manual - not having read the docs or explored the console is not an excuse to say that a service provider is being shady or rent seeking.

Even the title is misleading - Docker operations are not slowing down, the author has chosen to use cheap disks which offer temporary burst performance, and is being surprised when the temporary burst credit runs out.

Exactly. I'd go further and say the burst capacity is actually a super useful and powerful feature that is very hard to get on your own hardware. Definitely can catch you unaware if you aren't on top of it and can get expensive for some needs, but as you say not hidden at all.

I have a hacky shell script I sometimes use for moderately sized environments that don't have better monitoring setup, it reports the minimum percent of burst balance remaining over the past day, requires the aws cli and parallel:

  while getopts "p:" opt; do
    case $opt in
        export AWS_PROFILE=$OPTARG
        echo "Invalid option: -$OPTARG" >&2

  if [ ! "$AWS_PROFILE" ]
    echo "-p <aws profile> or AWS_PROFILE env var required"
    exit 1

  start=$(date -v-1d +%Y-%m-%dT%H:%M:%S)
  end=$(date +%Y-%m-%dT%H:%M:%S)

  aws ec2 describe-volumes --output text --query 'join(`"\n"`, Volumes[*].VolumeId)' | parallel -j 20 "echo -ne {}\\\t && echo \$(aws cloudwatch get-metric-statistics --start-time $start --end-time $end --period 86400 --namespace AWS/EBS --statistics Minimum --metric-name BurstBalance --dimensions Name=VolumeId,Value={} --output text --query 'Datapoints[*].Minimum')" | sort -n -k 2

thanks for this script, pretty helpful!

The addition of the burst mechanism to EBS volumes is relatively new. If you have been running on AWS for any reasonable amount of time, the sudden degradation of your disks will leave you scratching your head.

Dedicated IOPS wasn't originally pitched as the alternative to burst disks - it was simply that normal ELBs were not predictable based on peer load, and dedicated IOPS ELBs were.

The author's takeaways include moving disks to io1. This is a bad bargain in most cases, and particularly bad in the ~500 IOPS range (which is what I'm seeing in the Grafana screenshot there).

gp2 disks get 3 iops per gig "free", bursting up to 3k. (They don't burst after 1 tb, because your baseline performance is higher than the burst rate.) io1 is 25% more expensive per-gb, and you pay by the IOPS on top of that.

A 175gb gp2 disk will give you 525 IOPS baseline, at $17.50 a month. I'm guessing his volume is about 40gb, doing math backwards from his bottlenecked IOPS; a 40gb io1 with 500 IOPS will cost you $83.75 a month! And on top of that, AWS will cap you hard at that 500 IOPS; the gp2 can still burst if needed.

I know of two general cases where io1 disks are worthwhile: if you need more than 10k IOPS, or if you have very high IO requirements but very stable disk space usage (e.g. high-performance RRDs). Crack open Excel and do the math; it's worth the five minutes to check.

(Also, your burst balance is available as a Cloudwatch metric, as I recall! Set alarms on that shit!)

That's what we typically do to scale most EBS volumes (critical disk-intensive workloads run on io1 anyway). Instead of switching directly io1 we over-provision space which solves most IOPS issues.

We recently upgraded one of our etcd clusters from v2 to v3 and it turned out that etcd v3 uses significantly more IOPS than v2, using EBS burst credit during periods with normal load. Our solution was to increase the EBS volume sizes from 16GB to 50GB (100 IOPS -> 150 IOPS [1]). Our costs went from $1.6 per volume to $5, instead of $11.75 for io1, for a comparable performance for this particular application.

1: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolume...

Agreed. After a lot of struggles trying to find a good average IOPS on AWS for our database, we just increased disk to 1tb with gp2 and got rid of the problem. It gets 3000 iops/sec all the time and you never have these problems. It's still much cheaper than io1.

It's a trade you must account for when choosing cloud providers, nothing is free. For now it's still cheaper to pay for servers than spend time setting things up ourselves, but we're always measuring cost x benefits and thinking if we need to leave aws. The time is coming, but for now it's still working for us.

> we just increased disk to 1tb with gp2 and got rid of the problem. It gets 3000 iops/sec all the time and you never have these problems. It's still much cheaper than io1.

shockingly, tons of people don't know about this. Sure, there are use cases where it probably isn't cost effective to do this, but my default policy towards using gp2 is "will this server write to disk even a moderate amount? if so, its get a 1TB volume." Especially for something like a node in a k8s cluster where your workload is non-deterministic, this should be the default.

> Also, your burst balance is available as a Cloudwatch metric, as I recall! Set alarms on that shit!

That's my takeaway as well. If you can track and alert on it, you can provision whatever you want and only buy the expensive io1 when it makes sense.

Performance management 101 on AWS volumes:

1) The IO of a disks is proportional to the size of the volume. You need to get bigger volumes to get more performances. 3 io/GB

2) The high performance volumes (io1/PIOPS) are extortion. It's cheaper to pay for a bigger regular volume (gp2) that comes with a higher IO quota than to pay for the special high performance volume.

3) Each instances type has a disk performance cap. They are lower than you think.

4) Don't use t2 instances for anything that requires non negligible sustained IO.

P.S. Clearly the author is just discovering AWS.

tldr: AWS EC2 has the concept of I/O credits for storage. If your instance runs out of credits, bad things, which may seem completely unrelated, will happen.

I was having similar issues last week and did not consider I/O credits. I think AWS could do better at notifying you if your EC2 instance gets into this state (without having to set up a cloud watch alarm).

Rather EBS performance is non-deterministic. This is definitely not new, but is an informative deep dive.

The other big gotcha is that EBS volumes are lazy loaded. Not necessarily something that's going to bite you in a production environment regularly, but it's something that could very easily throw off your benchmarks and performance testing.

> EBS volumes are lazy loaded

TIL, that probably explains other weird performance issues I've seen.

further info: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initi...

Should you ever assume IO to be deterministic? Hard drives can crash, they can be slow(especially magnetic). Networks can disconnect, become congested or have random interference.

This is amplified on the cloud, but it should also be well known by anyone that works with the cloud. These things are well documented, although I could see why you don't really look for it until you have issues with it.

I think the problem is that from your average application developer's point view - there's nothing that they can really do that they aren't already doing (unless they're blocking the main thread on disk IO, in which case they can do a little bit more).

Can't find it in the docs now, but the certification course I'm doing now claims the lazy loading is not happening anymore. Apparently prewarming volumes is not necessary these days unless you created the volume from a snapshot.

> I think AWS could do better at notifying you if your EC2 instance gets into this state (without having to set up a cloud watch alarm).

How would you configure these notifications? Who do they get sent to? How often? Do they escalate to someone else? When?

Once you answer these questions, you have come up with ... CloudWatch.

Surfacing the metric on the host itself, and not buried in the EBS volume, would be a good start.

WRT alerting - if you're at 0 credits; your performance on something obviously under load is now being degraded, why not surface an alert on it? AWS already has a configurable events system in place for EC2 - this seems like a good fit.

We hit the IOPS limit on AWS many times, both on VMs and SQL instances. The solution was always to artificially inflate the size of the underlying storage, as you get 3 IOPS per GB on persistent disks (the other option was to buy IOPS, but this somehow always turned out to be way more expensive).

This issue was on a "pros" section when we decided to move our operations to GCP, when you get 30 IOPS per GB on persistent storage, so 10x more than on AWS. One way or another, if you really need _a lot_ of IOPS, you better stick with a local (ephemeral) SDD storage – just bear in mind it will vanish along with your VMs.

In other words, if you do not rate limit yourself then others will rate limit you.

It's called out in the post: most applications are not written to deal with reads / writes suddenly dropping in throughput. How are you supposed to rate limit the IOPS of a program you don't control? Can you make `docker' do exponential backoff if it notices some operations are taking too long?

In a network scenario, the remote service could say something like Quota exceeded, please try again later. read and write syscalls don't really work like that. They instead remain in uninterruptible sleep, which means they can't wake up, or be killed or stop.

The debugging story was interesting, but what really sticks out for me is that Amazon has pretty tight QoS working for distributed storage. That's actually a really hard problem - much harder than its better known networking equivalent. As much as I might curse the Amazon business folks for using it to screw customers, I also have to give kudos to the engineers for implementing it.

Why not use an ephemeral volumes for the docker data. This is a CI system, so the docker images are all transient anyways. Seems like an easy way to avoid the I/O credits.

The most recent "general purpose" instance families (C4, G3, M4, P2, R4) don't offer ephemeral storage anymore. My guess is that AWS will only offer ephemeral storage for a few selected instance families in future.

So you have to decide: older instance types (which will become more expensive as they usually don't benefit from new price reductions) or no ephemeral storage

We do this for our CI and it works well; we use I3 (spot) instances, introduced earlier this year, which have large amounts of NVMe instance storage. It is essential to push everything you want to keep to S3, ECR, etc so it won't work for every case, but I'd recommend it if possible. And of course, if the instance does go down, your Docker cache goes with it so you'll suffer slow builds while it warms up.

The I3 instances are pricey, but if you can suffer possible CI outages and everything is self-healing, they're extremely cheap as spot instances.

Good point, but they may be caching or storing results on the volume

+100 points to the excellent debugging skill demonstrated by the author.

It's great to see someone in top form.

A 1TB gp2 volume is cheaper than an 1TB, 3000 iops io2 volume, and provides nearly-identical characters.

Only use io2 if you have a latency sensitive application or need more than 10,000 iops. Even then you can RAID10 some gp2 volumes together and with enhanced networking get I believe 30k iops out of one instance.

even thenm if you really are worried about latency, AWS isn't the place to be...

I think after a decade of "Cloud" hype that many customers are finally realizing that the costs/benefits of using a provider are just more complicated and more expensive for most of their needs.

How do you flip a volume on the fly? I thought you had to do the "snapshot > make new volume > reattach" route

edit: thanks for the info, znep (hit my comment limit, hence the edit... )

Not any more, as of some number of months ago you can grow and change volume type on the fly especially if it isn't a boot volume.


Read the limitations very carefully, http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/considera... In particular if it was attached before Nov. 1, 2016 you need to do a one time stop and start of the instance or detach/reattach the volume and there are limits on instance types supported (but the definition of "current generation" is broader than you might assume at first).

This is quite an awesome enhancement, we were able to transparently convert a bunch of 15TB volumes from gp2 to st1 without downtime or impact to the app, and save a bunch of money.

For me, this style of writing really detracts from whatever the article might actually say.

I actually found it enjoyable and refreshing. Technical, but fun.

EBS optimized VMs looks nice on paper, you can choose the size and pay for your needs only. But the once you start to use the disk in production you see the problems.

In short, if you want to use the disk a lot, you need to pay a lot. If your app is slow on aws and uses EBS as storage, increase the disk size, not the VM instance type. This is true mostly for database performance, which once the RAM is filled, relies on IO a lot to get new pages from disk.

This is the kind of advice I wish was easier to get. The AWS docs are very lacking in my opinion for "If you're exhibiting this specific performance problem the cheapest option is to increase [x]" where [x] is increase instance size, disk size, performance, add a read replica, etc...

if you don't need to deal with HIPAA PCI DSS etc. going with DO, Vultr and the like would save many startups considerable $ compared to AWS.

If you're interested in IO performance, maybe don't run Docker - whose main advantage over VMs is fast IO - on top of VMs unnecessarily?

Triton and OpenShift add proper isolation to Docker and hence provide fast IO since you're not adding a layer of Xen.

I love native containers, but none of the major cloud providers support them so that isn't really actionable advice for most people.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact