I recently heard from a senior IT person that the company for which he works could save many millions by moving a certain application from a hairball of micro-services at AWS to a much simpler architecture on colocated servers... but the company's management doesn't want to hear any of it. In fact, management wants IT to move every legacy application that's not yet on the cloud to the cloud, specifically to AWS, because it's considered "more flexible" and perceived as "the future."
He's taken to calling AWS the "lifetime employment program for enterprise software developers."
When consulting for a seed funded startup, I suggested they buy their own servers and colocate them. It could save them almost 100k a year.
Everyone looked at me like an alien speaking a different language, lol. Then was politely dismissed, even though I'm experienced in running hardware.
AWS, GCP, Azure, really managed to expertly pull off the greatest heist of all time. They're useful for sure, but somehow they've gotten everyone to drink this potent marketing kool-aid that if you're not using them then you're not serious about success. So cringe.
They'd have to keep you employed to manage the hardware. What happens if you get run over by a bus (which I never hope will happen)? What happens if you decide to move to another city? There are a lot more "what ifs" by employing you to build and manage the hardware.
Saving $100k would not be worth the hassle/risk for most well funded startups.
Again this is the kool-aid speaking. The co-location manages most of the complexity and the probability of failure is basically zero for the first 4 years.
Hardware these days is really good.
People have this illusion that servers are extremely hard to maintain because big companies constantly have to maintain their 1000s, and because clouds are incentivized to sell this lie.
It would still be a huge cost saving and trivial for them to contract someone to maintain the servers for like 1 hour a month with emergency on-call. Of which there are numerous such people who are not me.
I used to build PCs. You can get the same hardware for nearly half the cost of pre-built PCs like HP or Dell. I did a lot of research on the best bang for the buck parts and the most reliable parts. If something broke, just pull it out and replace it. Easy.
I worked in an IT department as an intern. I thought this company should build custom PCs for all their workers. You can get a much faster CPU or much faster GPU for the same price! It's a no brainer!
Then I realized that if I wanted to build custom PCs for all the workers, I'd have to provide support for all of them. I'd have to be as good as CDW, who was supplying the company with PCs. This means next-day repairs and replacements. Drivers. Security. Firmware updates. For hundreds of PCs.
Needless to say, a lot of people under appreciate all the things that vendors like AWS or CDW provide. They think they can do it because they know how it's done. But they're ignoring a lot of other factors that can come back to bite you.
I'm sure you're very competent at managing server hardware. But if I'm a seed stage startup, I'd politely decline your offer.
> I'm sure you're very competent at managing server hardware. But if I'm a seed stage startup, I'd politely decline your offer.
It’s not either AWS or “bring your soldering iron”. Self-hosting K8s on managed hardware is still insanely cheaper than AWS.
GitOps, infrastructure as code, DevOps platforms, CNCF best practice setups are entirely doable without paying the Amazon tax. If your lead DevOps quits, you’re right, it’s easier to find someone who does AWS.
The last time I worked for a company that was invested in AWS, their AWS budget corresponded to five full-time senior developers in running costs.
It’s not much different than outsourcing DevOps, which I think is what some hackers frown upon, because that’s their skillset you buy in town at a premium.
Running K8S is itself a giant tax. The complexity is absurd and the components are not as reliable as having a literal team of people who are responsible for ops and making a service feature complete.
> The last time I worked for a company that was invested in AWS, their AWS budget corresponded to five full-time senior developers in running costs.
This means nothing to me -- what is the TCO of the AWS deployment vs. the TCO of the non AWS deployment? How flexible are the different systems for the future? What are the reliability/uptime requirements and the marginal cost of outage? How replaceable are the dev ops engineers who are running your K8s/managed infra?
People moving to cloud isn't irrational. There's a reason that it's pervasive.
> what is the TCO of the AWS deployment vs. the TCO of the non AWS deployment?
I can only speak of one PHP/TypeScript app with a poor performance profile: from $250/mo to $5.000/mo in running costs, not counting the migration. The team that made the app couldn’t maintain the Helm chart or the Terraform resources, adding to a DevOps team bottleneck that appears to limit how teams innovate (kind of how like creating new branches was expensive on CVS).
> How replaceable are the dev ops engineers who are running your K8s/managed infra?
I haven’t met a competent AWS DevOps engineer who wasn’t also competent to run the clusters; it’s just that they prefer not to.
> People moving to cloud isn't irrational. There's a reason that it's pervasive.
Yes, it’s made really solid. AWS makes sense to reduce operational risk at a large scale.
The price is not my main argument against using AWS, although you’ve gotta make good money to justify the expense.
My motivation is to own one’s own infrastructure. I like to work on things that are both important and contented enough that you can’t, as a principle, outsource the operation.
You can obviously buy servers from HP or Dell. I have about 80 in a colo datacentre.
Power, air-conditioning, security etc isn't my problem. That's what we pay the colo for.
I go there about once a year to install new servers, and about once at some other point in the year to replace a failed disk. I've taken various developers who are interested each time, so if I'm out of town there are several people who can cover, plus the colo staff.
You can also pay for people to do this routine work.
And then there's still the option of renting managed servers, so that company handles all the hardware.
I think it's nice to have a completely different type of work a couple of days per year.
I've found there's generally developers who are into building custom gaming PCs or Raspberry Pis or whatever who are interested to see the servers and lend a hand installing new ones.
I actually am into hacking at home but I'd definitely not play around with my business' production environment. Wouldn't know where to start if sh* hits the fan.
Not sure why you're being downvoted. But I love building PCs too. There's no way I'd want to put together my company's server hardware though even though the basics are the same.
In the last few years, build-your-own storage got annoying. Most server disks were SATA, mostly the same form factor and SATA ports are available in abundance. You could get various more complex SFF-xyzw connectors, but the adapters were straightforwardly available.
Now there’s NVMe and there are quite a few form factors. And NVMe is generally a point-to-point link to the CPU, so you either need a motherboard with the right number and types of connectors or you need a PCIe/NVMe splitter or switch. And the latter don’t seem widely available on the non-sketchy-hardware market.
So the situation is fine for system integrators who make their own PCBs, but not amazing for DIY builders.
I mean, it's not like a colo doesn't have issues. At one point our colo was running some cable and we discovered that their cables to our rack were wiggly...after our stuff went offline intermittently. That was hard to track down.
And you need to set up the LOMs, the management network (hey, you don't want that on the public internet), VPN, serial cables (backup to network LOM), maybe a load balancer. And you have to hope that when you reboot the box it doesn't fail POST. And you have to remember to turn off power saving in the BIOS (how many people forget to do this?). And you have to set up the RAID 5/10.
And you have to do this sitting on the floor, because not a lot of places provide chairs in the racks. Oh, do you have a rack-mounted KVM/monitor/serial console or are you using your laptop with a serial cable to do all this?
I mean, it's not hard, it's just a lot of detail work. And I suspect that these skills are going away.
You write with authority, but then admit you haven't used this stuff for 8 years.
8 years ago you could already set up RAID or change BIOS settings using LOM, and KVMs were obsolete. The slightly janky Java remote access tools have been replaced by HTML 5 ones, and at least HP and Dell provide tools to script setting up a fleet of similar machines.
When I set up new servers, I photograph the sticker with the serial number, MAC address and LOM password, mostly so I can be sure exactly where each machine has been racked. Connect power, ethernet, LOM. Check the DHCP server log on the LOM network to see there are the expected number of new MAC addresses present, then go back to the office to finish the setup.
These are great points, thank you for the insight. A small note I will add is that these days there are remote BIOS options available for servers like iDRAC
DRACs are a Dell specific term for what is generically known as IPMI or out of band management. They were quite well developed when I worked on servers well over a decade ago now. The good ones are basically as independent of the system they're attached to as possible and essentially give you the equivalent of physical access to a server.
Yeah, I'd consider the DRAC completely independent of the BIOS question. There are also OS tools - I believe Dell's is called OMSA - that can flip some of the switches in the BIOS (I remember this from having to mess with NUMA settings at one time), but they still needed to reboot.
As far as the crypto stuff, it's an issue. I fairly recently had to tweak some old 11th generation Dell Servers and tried out the DRACs, but Java really doesn't want you to use RC4 - I remember that being the dealbreaker. Nicely, someone put a conatiner up on Docker Hub that dealt with it https://hub.docker.com/r/domistyle/idrac6/
I have rarely needed to make changes, but if I recall correctly there were some settings that would take effect immediately, and others that required a reboot. I'd need to refer to the documentation.
I agree that there's a point where co-locating is a good thing to think about. But I don't think $100k is it. To put it in perspective, $100k is probably how much an intermediate engineer costs a company (not just salary).
In exchange, you give up the usual cloud conveniences like easier budgeting, easier scale-up (and scale-down), lower up-front costs (which is a big deal for most early startups), better SLAs, being able to use cloud object storage without paying transfer fees, access to support, fewer things that can go wrong, etc.
Buying your own server requires an upfront investment. AS 90% of start-ups fail, hiring your computing resource flexibly is far better, even if it comes at a higher cost. Once your company is mature and relatively stable, it is worth considering if hosting your infrastructure is more cost-effective.
But now you have a culture that has no experience or value with self hosted hardware. How are you going to fix that cultural problem? Hire a bunch of outsiders? How’s that gonna fly?
Bringing your infrastructure from Cloud to on-perm is a project; you would have to hire a dedicated team. Once you have proven your start-up is providing a genuine solution, it is a lot easier to get funding for this.
You can provision dedicated servers with Terraform.
They can still be cloudy in the way you deal with hardware maintenance and risk of failure, and how you authenticate with them and how you name them, and how you hook them up on VPNs. But they're physical units without the AWS bells and whistles or the AWS premium. Not everyone needs their own colocated rack, and manage their own UPS and network peering.
The main problem is OPEX versus APEX, not whether you can Terraform your infrastructure. It depends on your business model whether self-hosting is more cost effective than the Cloud. Cloud is more flexible and as a result you pay a premium, however, the cost differences are not as large as claimed in this article
I think you're missing GP's point: On-prem/self-hosting vs. cloud aren't the only two solutions. You can also simply rent a managed (physical) server, like many people did before The Cloud became a thing.
> I wouldn't say that renting managed servers is simple at scale.
…unless they are not physical servers but vservers. Hetzner, for instance, has such a "cloud" offering where you can dynamically spin up new vservers via an API/Terraform.
They represent yet another point on the spectrum but are not "fully cloud" yet if we take "fully cloud" to mean that, beyond hardware, the software stack is partly managed by a third party, too.
Put differently, to me the spectrum looks more or less like this:
- On-prem self-hosting (own data center, own hardware etc.)
- (Off-prem) non-managed hosting (data center provided by third party, but you're providing the server hardware and put it in one of their racks)
- Managed hosting of physical servers (data center + server hardware provided by third party)
- Managed hosting of vservers (Hetzner Cloud, AWS EC2, …)
- "Full cloud": Software stack is partly managed by 3rd party (e.g. managed Postgres, managed Kafka, managed ElasticSearch, serverless, etc.)
This is a very good point. In other words, it's a classic Capital Expenditure vs Operating Expense. For startups it's usually more efficient to have a higher OpEx in the beginning and focus on optimizing OpEx later.
If your startup survive next 6 months it is already more cost-effective to buy a server.
And if startup fails you still have a server to run your next startup.
A significant number of "cloud migrations" involve moving an application running on Linux on a server to running on Linux on a VPS for similar or higher cost. Not even using ANY of the "features" of the cloud, just treating it like a server. But then they can say "cloud" and everyone's happy.
Ah yes, the "actual" way to use cloud is to undergo a major re-write of your application such that it is inextricably tied into the proprietary PaaS options of your cloud provider.
Cloud is amazing for highly variable loads. I also see it as a bit of a luxury service for ops types (like me) - I don't have to go to a DC and manage hardware or deal with a DCops crew, so that's nice, but you probably wouldn't buy luxury cars if you were starting a courier business.
Let's face it, fear of vendor lock-in is almost always bullshit.
Everyone's locked into a vendor at some level or another. Want to move off of Oracle? It's work. Want to move off of servers and into VMs? It's work. Do you want to move off of some Java library like Spring? It's work. Even moving from apache to nginx is work (assuming you're not just serving static files).
Your technology choices always lock you in. You can minimize the cost of moving with architecture, but there's only so much you can do.
The key isn't avoiding vendor lock-in, it's understanding what the tradeoffs are for that lock-in and making sure it's a decision instead of a by-product.
It's like writing your stuff in NEW_TECH_LANGUAGE. Sure it might give you better performance, but now you have a bunch of deployment and maintenance headaches, and your people will cost more and be harder to find. And does it even work/scale in production/HA? How do you monitor it? What kind of weird runtime things are you going to get abused by?
I'd say there is a difference between technical path-dependence/tech debt, and a vendor who's actively able to take technical and legal steps to try and compel you to keep using their services. And obviously you should be getting something for the lock-in, otherwise you're setting money on fire for no reason. The fear is what happens once you are well and truly boxed in - somehow IBM still sells a bunch of mainframes, for instance. (Maybe they're truly amazing machines still; I know nothing about that world).
I do spend a fair amount of time and effort trying to keep my company nimble; we try to use open source software and generic cloud primitives so we can move with some ease if need to. As you say, it is all work.
I've never heard of AWS suing someone to keep using their services.
I wouldn't be surprised if Oracle or any outsourcer did that...but that implies a breach of contract, not a transition after a contract is done.
IBM sells mainframes because they're impossible to replace due to the way they're used. Those big printers that print financials statements apparently don't talk to anything else (funny, I was just talking about those the other day). It's about I/O and throughput with those things. You'd think someone would have come up with a replacement, but presumably the market is so small that nobody cares.
What really annoys me is that few if any clouds actually allow "dynamic scaling" of a single instance running operating system where you can hot-add more RAM or more CPU, with or without a restart.
Some can do it, some require you to basically image the entire machine onto a new one, some can't handle it at all.
I’ve never worked at a cloud company, but I know more than I’d really like about some of the stuff under the hood.
First, changing out the CPU type to something that isn’t almost identical out from under a running VM is a mess and may require degrading the system by removing features from the starting CPU. Switching manufacturers at runtime (AMD vs Intel), while sort of possible in theory, is effectively a lost cause.
To add CPUs, you have to deal with the architecture’s nasty hot-add CPU mechanism, and the guest OS needs to be expecting those CPUs starting from when it boots, and that expectation isn’t free. (The latter is Linux’s num_possible_cpus vs num_present_cpus.)
Adding memory involves getting that memory into the kernel’s memory map in the right places. This is more complex than one would like. Removing memory is worse.
And all this happens, in the cloud data center, on essentially normal hardware. If you want to add RAM, the CPUs need to be on a system with more RAM, preferably on the same NUMA node. And vice versa for adding CPUs. If the tenant is paying for local storage, that needs to move, too.
I get why actual "hot add whilst running" isn't a common feature (VMWare can do it but often does it via various tricks, such as the instance is already 'sized' with the number of CPUs/RAM and some are marked inaccessible).
The thing that really pissed me off is when a cloud provider is UNABLE to add more CPU and RAM on reboot. I even understand having to migrate the instance disks to another host that has CPU/RAM available.
Your frustration with VMs and cloud providers is completely understandable. VMs are essentially entire machines, which makes hardware manipulation quite challenging. However, have you considered using containers instead? They are often easier to manage and offer more flexibility in terms of scaling resources up or down as required. Let me know if you are a developer, as I may have some recommendations to share.
From a biz perspective, that can be a bad move and signals that you're not in growth phase but in cutting costs to preserve profits. Using cloud derisks you for future costs because it's managed for you, and then you allocate capex towards making better features. Doing it locally means having a team of IT admins/professionals you normally only need a few of which also incurs a lot of cost and risk.
Siloing expenses into capex and opex gets a bit silly when the cost effectiveness of that opex becomes low enough. For some cloud services, one can replace it with on-prem or colo’d equipment and recover the capex in months.
Similarly, “easy” scalability doesn’t necessarily make up for COGS when scaling up. If I can flip the autoscale switch while I grow to $1bn ARR and pay $750M/yr to the cloud, I’m not doing nearly as well as if I hire ten FTEs and pay $150M in combined capex and opex over two years.
(And if my COGS is 75% of the list price, it’s hard to give discounts, pay for sales, etc)
I think a lot of companies are going to find that they have the same relationship with AWS as banks do with Bloomberg i.e. a massive spend that they constantly try to manage downwards but can never escape.
I helped migrate a PHP app to AWS and K8s for a startup that got acquired a couple of years ago.
Before it was running on a dedicated Xeon with the database available on a UNIX socket. After... where do I start.
The app was very database heavy and experienced timeouts due to too many database calls (IOPS). Not a problem prior to migration, because you don’t pay separately for UNIX sockets. Just the IOPS ended up costing several hundred bucks a month.
All environment variables were stored as secrets at $.40/mo. per variable. I know this is peanuts, but it is so illustrative of the pricing model.
It feels like those candy bags where each tiny piece of candy is individually wrapped in plastic.
This is a new type of conflict of interests.
I assumed only good intentions from IT guys.
But you're right that some of them may be more interested in complexity of used solutions, bigger budgets, and their personal job security than in company results and sustainability.
It's not like they want complexity, it's the latter: the "nobody ever got fired for buying AWS" problem. They stated it explicitly. They're terrified of anything counter-cultural or that seems to be going against the tide because they want to be seen as in the future and not stuck in the past. Fashion, FOMO and weakness combine to yield a "cloud at any cost" mentality.
There are a variety of other expenses associated with an on-premise environment considered indirect expenses. These expenses are often referred to as “hidden” expenses due to how often they are overlooked rather than “hidden.” These include:
The real estate of the storage space used for the servers
Tools used for temperature control in the data center
The cost of set up, configuration, and ongoing upgrades
Staff salaries for administrators that maintain an on-premise data center
Networking infrastructure set up and ongoing maintenance
The cost of downtime while the team troubleshoots the issue
Productivity lost when the system experiences downtime
The cost of keeping the servers powered 24/7
Depreciation of the hardware and software
Time spent on disaster recovery
Administrative costs associated beyond IT staff such as HR, purchasing, financing, and other departments.
Hosting providers such as Hetzner or OVH take care of all of that for you and include it into a fixed, monthly price per server.
I however would like to address some of your points which I find absolutely ridiculous:
> Staff salaries for administrators that maintain an on-premise data center
Every cloud-native company I've been at had an entire team wrangling YAML files around. I'm not exactly sure why they had to do so (because everything was stable and as you say, the cloud is supposed to handle all maintenance/etc for you) but they did and cost a pretty penny.
The only case where I genuinely agree that the "cloud" saves money on administration is fully-managed PaaS providers.
> The cost of keeping the servers powered 24/7
A non-cloud-based, peak-capacity-sized deployment costs less to run than a cloud-based deployment scaled at minimum capacity. Servers are super fucking cheap nowadays. Hetzner will happily sell you a 16-core, 128GB of RAM, 4TB redundant NVME SSD machine with 20TB of included external bandwidth for ~150 bucks a month. AWS will cost more than that on bandwidth alone.
You speak like clouds never have downtime, or don't charge you for space and power. Look at the article - that one company would needed to have paid more than their entire revenue to AWS for worse hardware. They'd have spent $400M in 3 years on the cloud! There are no "hidden costs" that can even begin to approach a fraction of that.
As for staff salaries, lol. It's not like AWS is self administering.
In the cloud, costs are transparent, as you receive a single monthly bill. When you self-host, there are many hidden expenses you would need to track down (energy, admin, maintenance, etc.)
The figures in the article are for illustration purposes only and should be taken with a large pinch of salt. The author doesn't detail what hardware they are running, what EC2 instances he has selected for comparison and how comparable the storage statistics are. I would also love to hear the Finance Department's version of his calculations.
AWS is more expensive than self-hosting. However, it is not as skewed as the author claims. Otherwise, very few companies would be using the cloud.
But that doesn't really prove much does it? You are saying they could rearchitect/simplify the whole thing and move it on site to save money - but what about rearchitecting the whole thing and leaving it on AWS?
Which part would actually saving them the money? - rearchitecting/simplifying, or changing where it runs? There would need to be an apples-to-apples comparison to be meaningful.
That said, I don't believe everything needs to be in the cloud - but maintaining data-centers are also very expensive, and need to be accounted for very differently.
Tale as old as time. Heard the same with exchange, MSP contracts, etc etc some suit somewhere always has some CYA reason no matter how silly or wrong it may me.
Retail cloud setups provide standardization that hosting on bare metal may not.
The irony of it is most of the cloud services are open-source technologies packaged up to be easy to use and administer from a web interface.
Maybe if someone came up with a set of tooling in AWS to use many lightsail VMs to provide more of the basic AWS services, it could be a way to get the best of both worlds.
I often fantasize about moving a bunch of my company's crap off of ECS/EKS and onto a managed colo with just old-school ansible deployments. I've even spec'd out some bare metal from our local colocation facility and the servers they offer are so ridiculously powerful and CHEAP. We could run our entire production system in a half-rack for about 10% of what we pay AWS.
But alas, nobody will ever go for it and I'm left to dream.
I had to look it up - Ansible's initial release was just over 11 years ago. Calling it "old-school" sure took me by surprise, though I suppose that is a long time in the tech world. To me, old school is doing everything via Bash or even by hand.
To me, old-school is the approach of updating servers so they match the desired configuration, rather than wiping the servers and deploying new versions of your containers.
I’m thinking more about evolution within the space of automated configuration tools.
I use Ansible at home as a replacement for the 1990s style of sysadmin where you manually managed your servers (which I did until the mid 2010s). Yet, I’m aware that Ansible is not declarative. It’s looks declarative, if you take your glasses off and don’t think too hard about it, but the key to understanding Ansible is really that the configuration is idempotent (not declarative).
Servers are cattle either way, it’s just that with Ansible, your servers will keep changes that you made months or years ago that have been deleted from your playbooks. That, to me, is what makes it old-fashioned.
I don’t think old-fashioned is bad. I’ve gotten into arguments where people have told me to manage my personal server(s) using K8s and I, politely as I could manage, told them to fuck off.
The additional complexity does not bring sufficient advantages. I’d like to spend my time writing content for my web site, not configuring container orchestration tools which will sit idle for years at a time.
I feel like you have a unique perspective to offer here - though at the moment I disagree given the lack of detail.
What about the specialized services from AWS provides leverage outside of the “cattle mindset”?
In my experience, AWS’s primary value and leverage is based on intermittent burst compute. It’s why you rarely get a hard answer on the Hz of each vCPU and have credits to the overuse or underuse of said instances. The cpu is then used to arbitrage value in the dedicated services like lambda and others - so your mindset and infra needs to match AWS’s incentives to get the same value, correct?
The main disadvantage of the colo servers I administrate is the lack of flexibility for periods of hours to days.
We use ~50 fairly high performance (RAM, CPU, disc) servers pretty much continuously, so the savings compared to AWS etc are considerable. Every couple of years there's a software upgrade required, and it would be convenient to have another 50 servers to use during the transition, just for a few days.
Compared to AWS prices, we can easily justify keeping some old servers around to help with this sort of thing. A managed server company (Hetzner etc) would be somewhere in-between; presumably able to rent us 50 extra machines for the duration, but (last time we checked) still more expensive than managing the servers ourselves.
> The cpu is then used to arbitrage value in the dedicated services like lambda and others
This could be only the case is t* instances?
I learned the hard way that CPU misses can add even 30% of request latency, especially when you often do I/O eg to external services like db/Redis/etc
I didn't notice such behaviors on AWS.
but when thinking about it, I read somewhere that one reason they have dedicated CPUs is for partitioning CPU cache so it's not shared between VMs. so maybe with some tricks they could get some free compute without side effects?
None of these are the gotchas cloud people seem to think they are. The answers look like this.
Who's going to be on-call: the same people who are on call now. HW failures don't need special on-call people.
Who is going to build/manage your monitoring/alerting stack. Same people as now. Your servers are monitored, right? Adding RAID into the mix is hardly difficult.
Will you hire a network engineer? No, colos will often offer to manage the network for you up to the switch and at that point it's just a matter of plugging the server in. You don't need any expertise to do that and remote hands can do it for you.
Redundant hardware? Only if you need it currently. HW is reliable these days and you'd going to have spare capacity anyway, so this really doesn't change much.
Disaster recovery? Take whatever answer you currently have in the cloud for multi-AZ.
Direct-attach storage is a major improvement for things like databases, etc which you pretty much never have in the cloud, at least not in a persistent form.
For conventional relational databases like Postgres, multiple nodes only give you reliability, not performance (ignoring things like read-only replicas which your application explicitly has to choose to query), so horizontal scaling doesn't help there.
Enterprise-grade SSDs is what I meant by direct-attach storage - I was comparing against network storage which is what all cloud providers use (your EBS volume is accessed over the network internally, which incurs some latency, ultimately limits IOPS and destroys random access performance which can't be cached or read-ahead by the underlying hypervisor).
“multiple nodes only give you reliability, not performance” - this, actually, is not true. Even mysql can offer ways to tune and gain performance improvements.
Those SSDs are ephemeral, so unsuitable for long-term storage. If you wanted to use them you’d need a task that copies the entire persistent volume into the ephemeral SSD at boot, and then the reverse on shutdown (and pray that the machine doesn’t die abruptly as that would mean you lose the SSD without doing that reverse copy operation).
Presumably they would also lack the durability guarantees of the normal block storage offered by the provider.
Ahrefs could probably save another few hundred million if they didn't repeatedly visit the same links over and over indefinitely to find the same error codes or a binary file that they surely don't care about. I see them in my logs for my podcast hosting service, hitting the same 404s for months and months. They end up hitting audio files and downloading (or attempting to download) many gigabytes of content each day. They don't do anything with audio! As soon as they get the HTTP headers, they could say "oops, I don't care about this" and disconnect. And in many cases, they got those URLs from the enclosure tags of RSS feeds... I'm not sure what they were expecting!
In the cloud, not in the cloud, frankly it seems to me like their business would be far more efficient if they tuned what their crawler actually crawled.
Ahrefs is like my white whale. I hate them. I can't stop myself from battling them. I hate the fact that they ignore status codes, I hate the fact that they ignore crawl rate specs in robots.txt, I hate the fact that they crawl URLs on my sites that have never and will never exist and keep coming back to do it again.
I cannot express emphatically enough how powerful the concept of classes of service can be for simplifying your production issues.
I think people see the notion of a network neutrality as an all or nothing thing.
Network Neutrality between networks, yes, but within your racks you’re trying to balance a large equation instead of two or three smaller ones.
If they reclassified sketchy pages they wouldn’t have nearly as much of the sorts of trouble you mention.
Entirely predictable. Cloud is good for some use cases. Self hosting is good for some other use cases. Neither is objectively “better”. Both should be seen and used in the context of a task in hand.
But alas it is human nature to pick up a hammer, call it the best thing in the world, and start whacking it at every problem. See: cloud, blockchain, k8s, and now chatgpt/ai.
In the last 15 years that I've seen this type of clickbait, 100% of the time it's written by someone who doesn't know how to make a TCO calculation. Of course, this time was no different. Should have saved myself the agony after reading the first sentence: "Clouds for IT infrastructure are so popular lately that moving into the cloud has become a trend.". Please.
The thing is, his TCO would include training, rebuilding their existing processes and procedures, and having to run stuff in parallel for a while. That's not including a re-architecture and the impact it would have on their stuff today.
Sometimes all people need is a big box in a data center. Management of the environment is already a sunk cost. The hardware is already expensed. They don't care about scalability/DR/etc.
From a hard dollar point of view staying colo makes sense. Plus politically they don't want to do it, which is really the only point of view that mattered.
If they needed to have actual DR then the numbers would be different.
Ah, so now the developers will service the AWS instances from your end which means they can work less on delivering new features... also left out of the equation quite often :)
They’ll also wait less for IT to service their tickets requesting new infrastructure, which leads to more new feature development. There are a variety of trade-offs; not all of them have the same sign.
For example, hardware doesn't just survive five years. As the hardware ages, the failure rate will increase. This calculation doesn't even mention this at all.
Regardless of a warranty, the hardware still fails and needs to be replaced. That requires maintenance. Beyond that, can they afford to wait for Dell to fix the faulty device or send a replacement, or do they need several spare units to address outages quickly.
Do you think that would make any impact at all when they'd be looking at a $400M difference in hardware costs? People these days act like you need 5 PhDs to plug in a computer, it's insane.
Yep. A completely natural instance of “computer nerd thinks that their expertise is transferrable”. The tech industry is rife with this sort of thinking.
I really don't miss running my own kit (colocated or directly owned)
I don't miss cycling around Manchester on a Bank Holiday weekend because I'd miscalculated how much network cabling I'd need for an upgrade.
I don't miss keeping a spreadsheet of storage so I knew when to order disks, negotiating with suppliers for cost of new disks because I was buying a slightly smaller bulk than AWS
I don't miss having to explain to folk in datacenter support that they could take the disks out of my failed server and put them in a new server if they had one available
I don't miss the day the single point of failure in the rack failed and everything was offline while I waited for a new doohicky to be shipped to me because it didn't make sense to keep spares of everything on hand
I don't miss trying to figure out if some new generation of server hardware would work for or would fit in my rack as manufacturers stopped making the kit we did use
I'm not going to say every workload should run in the cloud (cliche nod to StackOverflow) but it certainly isn't free to get all of the benefits
It's good they did the math. AWS doesn't fit everything.
Really, the main things that cost actual money in AWS are memory and CPU (that's including elasticache, RDS, etc). Bandwidth can be negotiated away. I'm sure you could do some kind of super deal on CPU/RAM, but we've never bothered.
Would it be worth it to rearchitect your app so it doesn't use 2TB of RAM? Probably not. You have a lot of sunk costs, and redoing everything will probably break everything. You guys are big enough that CapEx doesn't matter that much. You just need big boxes with lots of bandwidth.
If you're happy with what you've got, stay with it.
> We use high core-count CPUs, 2TB RAM, and 2x 100Gbps per server. On average, our servers have about 16x 15TB drives.
> So let’s say we run our 850 servers
I'm not familiar with this company, but it says on their website We’ve been crawling the web for over 10 years, collecting and processing petabytes of data every day., they have their own search engine (named 'Yep), provide all kinds of reports, etc.
I'm guessing they are running huge Hadoop clusters or something? One of those workloads that just isn't suited for the cloud, and they're just taking the advice of Joel Spolsky
“If it's a core business function — do it yourself, no matter what.”
I'm familiar with the product. They help you optimize your SEO, provider serp scores, backlink health, organic keywords performance etc. Basically, everything to do with Google SEO.
I think calling them "just a bunch of hadoop cluster" might be disservice to the amount of effort that goes into building a product like this esp. When millions of digital marketing professionals use their service.
I feel like you’ve used a dishonest interpretation of this comment to push back against it and flex that you know what this company does?
The only person ther said “just” is you. And “large Hadoop clusters” can be the most resource-intensive component of their infrastructure, but not necessarily the company’s value-add, or where most of their effort goes.
One interesting question is: if you started today, would you be able to afford boxes that cost 60k? Or would you do your software some other way that doesn't require 2TB of RAM?
Obviously when you started your stuff didn't need 2TB of RAM. If I read your history correctly I don't think you could even buy a box with 2TB of RAM back then. That's enterprise grade hardware, which today costs a fortune. Back in the day it would be a bigger fortune.
Instead, you probably started the way everyone else did, with maybe a 4GB or 8GB linux box at home then built things up from there and a bunch of curl scripts and a local instance.
So why the resource requirement? An in-ram database?
It says they have 850 servers, so 1700TB RAM (if they're all like this).
If it's a big Hadoop cluster or similar, there's potentially huge performance gains by keeping data in-RAM during processing. They have many petabytes of data, so I doubt the type of processing they are doing today would have been possible when they first started.
I kind of wish they hadn't. Ahrefs claims to be in the "SEO" business, which as far as I'm concerned, has helped ruin web search, and indeed, the culture of the web in general.
Also, Ahrefs bot doesn't handle some things very well. I made an "infinite web site" in PHP a while back, and used Apache mod_rewrite to send every Ahrefs request to the infinite web site PHP program: https://github.com/bediger4000/infinite-fake-website
Ahrefs bot really freaked out, unlike some professional bots like Google's, and even Yandex' bot.
Do you really think they act any differently? They're in the SEO business, which is directly parasitic. They're not going to spend money improving their software, that's a cost center, not a money maker.
I wonder how many wikileaks@paypal alike events it will need until companies understand that they make themselves far too exposed to total-takedown when moving everything to the cloud.
Pretty interesting point there at the end of the post:
But with the mass layoffs in Big Tech in recent months, this may be an opportunity to re-evaluate the approach to the cloud, consider a reverse migration from the cloud, and hire seasoned professionals of the data center world.
I do think with all these layoffs we can't help but see some interesting things happen from people who have been let go. Either through their work at other places, or new startups.
despite what everyone thinks that doing things locally is cheaper, if you're running a business and allocating money towards headcount and benefits plus additional risk of maintaining team it rarely makes sense. HN is full of people who run a hobby server and think that a fortune 500 should manage its own PostGres
if you think you need more headcount to do things on cloud you're doing it wrong. Just the identity setup alone needs multiple people, and that's almost a full time job for them. On cloud you don't have to worry about that at all, and it comes with full auditibility and compliance.
Yeah that’s the sales pitch. As soon as reality sets in and you need to integrate with an existing identity service for example, things can get pretty complex quickly.
AWS & Co. use cases always look nice on paper, in reality it’s often not as easy as advertised.
Surely the cloud managed solution reduces the amount of custom magic that I have to put into place to keep things running. Everything is standardized & documented. There is a huge community to turn to if help is needed.
And very less work is required if I get hit by a bus and someone else is now the owner of the huge system that I previously owned.
It is standardized and documented if you DO document it and properly use IaC like CloudFormation or Terraform. But you can absolutely have people do ad-hoc changes in AWS that are not gonna be reproducible.
Also, all of your points apply to non-cloud systems too.
The performance benefits of a real DB server with local NVME and a TB+ of ram can save TONS of engineering effort/time. I have ran 10TB+ PostgreSQL DBs in both the cloud and bare metal and TBH, in terms of time required to maintain, there isn't a huge difference once you move beyond what RDS can do.
That being said, if RDS can do what you need USE IT. You can always migrate later and it will save you the need to hire someone who really knows postgres.
Everyone loves to run their own server in the garage until they’re on vacation in Thailand and the power goes out back home.
I actually saw a very impressive solution to that problem on an indie hacker forum recently. And wow could that person save effort with even just a $5/mo VPS.
It depends. Are these PAYGO prices or reserved instance prices? There is a HUGE difference. If these are PAYGO prices, the cost would probably be about the same to have your infra reserved in AWS WITH the added flexibility. This is why not just ANYONE should be able to spin up infrastructure in cloud providers. You should hire someone who knows what they're even doing.
This napkin math is pointless, if you're a business leader, you're going to factor in non physical costs as well, since you need a team to run hundreds of servers most likely. If business all of a sudden starts to go under, you can pull plug on cloud, but you'd have to write it off if self hosting.
Hosting yourself makes sense if you're providing hardware level services like storage or compute. In that case, going to the cloud is literally financing your (potential ) competitor.
I worked for a small (~300 headcount) software company that did CFD software.
We were told to build out our HPC capacity so that customers could use our hardware to run their jobs instead of having to manage a cluster themselves to run our software. The software was billed per core-hour, and that was the only charge. It made no difference if you ran it on your infrastructure or ours-there was no additional charge for our compute or storage or network bandwidth.
We bought compute in units of between one and four racks fully populated, usually lease with a trivial buyout at the end, or just outright.
In the last 24 months we were an independent company, our SaaS infrastructure
drove an additional $24 million to EBITDA. In that increment, we spent $9 million total on hardware, colo, network connectivity and our salaries. The total cost of replicating our compute capacity (for those 24 months) on AWS was ~$31 million. This all came out on the due-diligence that we had to do as we were being bought by a larger firm, so the accountants were satisfied that the numbers were accurate.
IOW, the article seems to be completely plausible.
Providers like OVH, Hetzner, etc will provide you servers and handle maintenance just like cloud providers, and it turns out that in both cases you need to hire sysadmins (or "DevOps" engineers). You don't need extra staff unless you do actual on-prem which very few companies need (most can be served by a provider like the aforementioned ones).
Did you see how much we saved comparing to AWS? We could even hire an admin for each of 850 servers and would still have money left.
But we have only one person taking care of hardware replacements. That is enough for the whole setup.
Sorry, but if I'm reading the article correctly you didn't save any money, because you weren't using AWS in the first place. All you did was to do some back of the napkin math what it might cost you to run something on AWS.
Isn't it how you save money by picking cheaper product instead of more expensive?
We save by buying hardware and renting DC space.
If we were using AWS we would not save.
predictions are hazy, hindsight is 20/20. There's a reason there are dozens of companies dedicated to cutting cloud costs because surprise, they aren't configured well. Most of these probably severely overestimate compute needs and don't factor things like autoscaling.
You still need a team to manage the stuff on AWS. Probably the team needs to be even bigger.
If you’re a business leader, you‘re probably just blindly following what your peer business leaders are doing, that’s why you’re on AWS (in most cases at least)
Also, the comparison used 3-year reserved instances (if I read correctly) but didn't have a pricing plan (the part of the contract with AWS that specifies your discount). $400M->$300M if you know how to negotiate (I'm an engineer but to save $5M I learned how to negotiate. It's unpleasant but effective).
What I'm struggling with is their estimate. I work for an enormous enterprise that runs tons of stuff on the cloud and our budget is less than a quarter of their AWS estimate. We avoid products like EBS unless they are necessary, and use RDS whenever possible.
IME, business "leaders" follow the crowd even more than venture capitalists. If AWS gets frequent glowing praise in CFO Magazine, then by golly, we're going to move to the cloud and "save a bundle". Make enough similar decisions without actually being able to properly cost estimate the intangibles of what else is lost when those couple dozen experienced systems engineers move on, and you end up being valued at one-tenth of what you used to be within a few years (as is the enterprise where I was such an engineer).
They haven’t included an amortized cost breakdown with depreciation of hardware including backup. Or the geographically local personnel required (data center remote hands or dedicated employee that gets the call?). Even still I suspect it’s still probably cheaper.
Also it looks like these are retail prices. You can get cost savings by negotiating but I’m sure there’s an NDA required.
Your calculation is over simplified. For example, hardware just doesn't run for five years without glitch once once you have multiple servers.
For sure Cloud is more expensive, as it charges a premium for not having to invest upfront in a lot of hardware, however,if it was 1000% no one would be using the Cloud.
Scaling how often I have failures on my 80 servers, of which 50 are similar in spec to Ahrefs (16 HDDs), it should still be less than one full-time position to manage 850 servers. Probably much less, as I'm a bit inefficient when it's <5% of my job.
I have a fixed day or two each year working out what to buy and getting quotes (which would be the same time whether it's 20 or 200 servers), half a day per 20 servers for installation in racks, and call it half a day per 20 for initial configuration (most of the time to script the process, so it would also be similar for 200). After that, it's at absolute most an hour every couple of months to deal with a failed disc or similar.
He's taken to calling AWS the "lifetime employment program for enterprise software developers."