Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Is your company sticking to on-premise servers? Why?
779 points by aspyct on May 6, 2020 | hide | past | favorite | 755 comments
I've been managing servers for quite some time now. At home, on prem, in the cloud...

The more I learn, the more I believe cloud is the only competitive solution today, even for sensitive industries like banking or medical.

I honestly fail to see any good reason not to use the cloud anymore, at least for business. Cost-wise, security-wise, whatever-wise.

What's a good reason to stick to on-prem today for new projects? To be clear, this is not some troll question. I'm curious: am I missing something?




Like many others have pointed out: Cost.

I'm the CTO of a moderately sized gaming community, Hypixel Minecraft, who operates about 700 rented dedicated machines to service 70k-100k concurrent players. We push about 4PB/mo in egress bandwidth, something along the lines of 32gbps 95th-percentile. The big cloud providers have repeatedly quoted us an order of magnitude more than our entire fleet's cost....JUST in bandwidth costs. Even if we bring our own ISPs and cross-connect to just use cloud's compute capacity, they still charge stupid high costs to egress to our carriers.

Even if bandwidth were completely free, at any timescale above 1-2 years purchasing your own hardware, LTO-ing, or even just renting will be cheaper.

Cloud is great if your workload is variable and erratic and you're unable to reasonably commit to year+ terms, or if your team is so small that you don't have the resources to manage infrastructure yourself, but at a team size of >10 your sysadmins running on bare metal will pay their own salaries in cloud savings.


A few years ago I was trying to start a company and get it off the ground. We had to make decisions on our tech stack and whether we were going to use AWS and build around their infra. Our business was very data heavy and required transferring large datasets from outside to our databases. Even in our early prototypes, we realized that we couldn’t scale cost-effectively on AWS. I figured out that we could colocate and rent racks, install HW, hire people to maintain, etc... for way less than we could use the cloud for. I was shocked at the difference. I remember saying to my cofounder why does anyone use AWS, you can do this on your own way cheaper.

Later I worked at a FAANG and remember when Snap filed their S1 when they were going public they disclosed that they were paying Google $5B and we were totally shocked at the cost compared to our own spend on significantly larger infra.

I think people don’t realize this is doable and it’s great to hear stories like yours showing the possibilities.


Disclosure: I work on Google Cloud.

> paying Google $5B

You were off by 10x :). The annual commitment was $400M/yr on average. Snap’s S1 [1] said:

> We have committed to spend $2 billion with Google Cloud over the next five years

[1] https://www.sec.gov/Archives/edgar/data/1564408/000119312517...


> I remember saying to my cofounder why does anyone use AWS, you can do this on your own way cheaper.

I agree with everything you say - I'm convinced that a huge part of the cloud's financial success is due to how it allows CTOs/CIOs to indulge their fantasies about having a mega-scalable app - even if their workloads are very regular and predictable. Along the lines of buying an expensive sports car but never driving it fast, you're just paying for the kudos it brings you in the eyes of other people.

Having said that, we are happily using the cloud for our small app because it makes no sense to build out our own infrastructure for a single VPS and database.


> I agree with everything you say - I'm convinced that a huge part of the cloud's financial success is due to how it allows CTOs/CIOs to indulge their fantasies about having a mega-scalable app - even if their workloads are very regular and predictable.

It's a sane choice given incentives, too. Cloud bullshit features prominently on an awful lot of technical job postings these days.

But yeah I've definitely seen some heavy, expensive cloud setups that could have run on a toaster. Smaller-scale B2B stuff seems especially prone to this—like, what's your max reasonable traffic? If you took off like crazy? What's the size of your market? Come on. Throw in really half-assed and inconsistent use of automation tools and lots of relying on shitty cloud web dashboards and hope, and often the toaster (or a couple low-end co-located servers, more seriously) would be easier and safer to manage, too.


From a developer perspective, I prefer to work with cloud providers because I can do more with fewer developers because I'm not dealing with sysadmins too. I can throw up a EMR cluster with minimal effort and get a working product out the door quickly.

These things don't matter for large, established companies because they already have DevOps, SysAdmin, and Development teams. But for smaller dev shops, it absolutely makes a difference when you can generate a good bit more efficiency from your development staff.


That may be the case for a startup. But in the average corporate IT environment the on-prem server infrastructure and procedural/political jungle is so terrible that you just have to get away from it. Before "cloud" became fashionable it was politically hard or impossible to host your project somewhere outside the bad corporate infrastructure, but now there's a way to present the choice in a palatable way.


I will chip in here. Although I agree with all that you guys have said. Your comments seem to be skewed to a perspective that all companies are being run in U.S.A, Canada, Germany, etc very developed countries where power is constant and can be relied on. The skilled labor to run these servers is available for hire. The parts and infrastructure are available to rent or buy. Majority of the world is not like this but there are formidable companies all over the world running, and for them, this is one less unreliable aspect of their company.

To give you an idea, if you want to run a data center or your own servers in some countries you need a standby generator (because electricity is not a given) and the Diesel used to run these generators are imported and the economy of these countries is shaky so the exchange rate fluctuates, so suddenly the cost to keep your site up becomes a variable and is now subject to government announcements (not even in an evil authoritarian way), policies and import taxes. In the face of all of this, having a steady AWS bill with reliable infrastructure becomes priceless to these companies


You wouldn't run your own datacenter, you would colocate. Even if that's not possible, you can still rent servers for significantly cheaper than cloud offerings.


Wow... really a first world perspective here.


The you co-locate to the first world. Or somewhere that can get a data center built with resources that you trust.

I ran data centers for a living in Northern VA and had all sorts of international clients. Egyptian schools who rented servers, Brazilian Protestant ministries who shipped servers to us, etc. There were some decent data centers in Mumbai we had to get VPNs built for, and we had a least one legit client in Lagos, Nigeria.


3rd world perspective: They're not far off from the mark.

You'd want to co-locate with ISP who already has infrastructure for continuous services through blackouts et al, or you could have a datacenter if it's small-ish, because you already have some infrastructure to keep operating during blackouts.


I believe you are saying that some countries don't have data centers where you can co-locate. Most of the places where AWS has a datacenter have other datacenter companies that offer co-location.


I think you'll struggle to find any place without a few datacenters nearby, at least in a neighboring country. Of course it's going to be a bit further away if you're in central Africa.

Even then, you can colocate in a datacenter anywhere you want, have equipment delivered there and pay remote hands to install it for you for a very reasonable fee.

Of course this doesn't make sense if you just want a small webserver, but that's not who we are talking about here.


If you can rent an AWS server, you can also rent a Hetzner server, or put some servers in a Hetzner DC.



I support your comment.


You nailed it. The infrastructure in developed countries is taken for granted by its consumers. Websites are often not tuned for low speed internet, or for people across the ocean, because everything works so smooth in developed countries. Also, having a server that serves to the US and Europe if you live in a remote area obviously cannot be on-premise, it has to be in the cloud served in those countries for latency reasons.

And you cannot have a website that doesn't cater to US and EU users, unless the website only solves local problems.


If web site caters to EU customers, just rent it from the EU datacenter, no need for a cloud. A Hetzner server costs $100/month, an equivalent amount of cloud resources will run you into thousands.


And how much does Hetzners nosql/relational db, emr solution, faas, etc cost? Solutions aren’t built the traditional way anymore. You don’t go get some cots server and write everything yourself. So while you’re futzing around getting a database installed, tuned and setup the engineer using a more complete cloud platform has already moved onto business logic.


The person above wrote that she can't use local server to cater to US and EU customers. I pointed out that you don't need cloud to resolve that problem.


Hetzner is a cloud, as per their own literature but my point being if you buy Hetzner you’re not getting a complete platform. You are buying into a platform just less of one. Hetzner also advertises themselves as thrifty, not exactly instilling confidence for critical infrastructures. My point remains though, buying a VPS is equivalent to an EC2 and this is not how things are built these days. Also note the posters name is WilliamEdwards, I know if you called me a she (I’m a he) it would be very offensive to me. It’s best to use gender neutral or pay attention to sex/identity if possible (not really here) and use the correct pronoun.


VPS is in no way equivalent to an EC2


Sure it is, for almost every use case. For edge cases, it’s not exactly equivalent sure. If you just need a place to run code then it most definitely is.


Even so, having say a bare metal server (à la Vultr) can be way cheaper than having a cloud setup in one of the most flexible kinds of providers.

The detail to look at is that flexibility is a feature of the cloud offering, and an expensive one at that. If you don't need it, you need to find a way to not pay for it.


Dropbox did the same thing a few years back - moved everything from Amazon S3 to their own storage.

My guess is they did it for cost reasons.


They did, but at Dropbox's size it should be pretty obvious. I mean, their whole business model is essentially acting as a cloud storage provider. Once they got big enough, of course it made sense for them to optimize their infrastructure instead of letting another company take x% off the top.


> Once they got big enough, of course it made sense for them to optimize their infrastructure

Isn't that true in all cases?

There is no doubt that rolling and maintaining your own infrastructure can be and is better than dumping cash on the AWSs of this world. The only question is what size marks the breakeven point.


I think it is actually only very true in a small number of cases. First, you need to be big enough where the cost of the people needed to support your infrastructure can be amortized over a (very) large number of machines. Second, Amazon, Microsoft and Google pay some of the best salaries around, so they have some of the best infrastructure people working for them.

Looking at the comments here, I think it's clear that there are a relatively small number of use cases where roll your own is a better idea, primarily where you have a huge number of servers basically all doing the same thing with lots of data transfer (which is comparatively expensive in the cloud). This may be cheaper to manage when a small team of people if you're essentially cloning similar setups.


That S3 is eventually consistent with object updates (HTTP PUT) might also screw up things for a company whose core value is synchronized storage.


They might have implemented a metadata store 1 layer before S3 to guarantee read after write consistency since write to a new object is consistent in S3. Only updates are eventually consistent.


I don't mean to sound daft, just clarifying my own understanding, but isn't Dropbox eventually consistent (as a system)?


Oh, sure, but when they think they have written something to S3 and got a successful HTTP response back from the API, perhaps they want to be able to tell clients to go fetch the new data from the bucket. But those clients may not get the new data then, due to eventual consistency.


S3 is immediately consistent for new objects unless the service received a GET on the object before it was created. It's easy to use this to make an immediately consistent system.


S3 ListObjects calls are eventually consistent (i.e. list-after-put). EMRFS [1] and S3Guard [2] mitigate this for data processing use cases.

[1] - https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-f... [2] - https://blog.cloudera.com/introducing-s3guard-s3-consistency...


Yes. HTTP POST (new objects) is immediate, HTTP PUT (updates to objects) is eventually consistent.

If you want to use this to create new objects all the time, rather than update ones you already have, you now have to keep track of which objects in your bucket are "old" and should no longer be there. But yes, totally doable.


Since the move to S3 is only 'a few years ago', my guess is this was not the reason for them to move away from S3 since service at Dropbox is pretty much the same.


I suspect that is behind many problems with Google Drive.

Sometimes old versions seems to take precedence over newer ones, for some reason.


I expect many companies do that. They prioritize growth over costs (stock price over profitability).

Then they get to a certain inflection point on the growth curve and splurging for unpredictable capacity slowly is replaced with reasonable costs for the known capacity.


How many other companies do you know that have same problem as Dropbox?

An average company that uses does not provide IT infrastructure (like storage in this case) to a massive amount of clients.


You’re not paying Google $5B for raw infra, you’re paying for cloud services like top-tier horizontally scalable databases, global availability, CDNs, datastores of different flavors, and transparently managed monitoring and hardware fault resolution.


Yes, you’re paying for a ton of stuff there that you probably don’t use and then are susceptible to bugs that have nothing to do with your use case. At $5B it would not cost anywhere near that much to replicate. Your infra would be better tailored to your workloads and your sw teams which would further drive costs down. The upsides are that you only drive features you need and keep things simple. The downside is that your need a big org to do this. It’s a long term play.


Snap was spending 2B over 5 years - so 400M/yr. By comparison, Uber, who operates under managed colo, spends nearly 200M/yr alone on real estate for their datacenters. Who knows how much they are paying in engineering salaries so manage those datacenters.

Personally I don't think there's a one size fits all solution, you will have to do the math (like I'm sure Snap, Netflix and others have done) to see if cloud is worth it. However, I agree, for most teams the default should be cloud.


> Uber, who operates under managed colo, spends nearly 200M/yr alone on real estate for their datacenters

200 million a year, huh.

The cost of commercial office space in the U.S. can range from $6 per square foot in low cost regions to over $12 per square foot in New York City. On average, a 50-cabinet data center will occupy about 1,700 square feet. At a median cost of $8 per square foot, the space alone would cost about $13,600 per month. [1]

Are Uber renting on the order of two million square feet of data centre? Do they have sixty thousand cabinets of hardware?

If they do, i would absolutely love to see a quote for how much it would cost them to run in the cloud.

I think it's far more likely that number is bullshit.

[1] https://npifinancial.com/blog/data-center-pricing/


>Office and data center rent expense was $194 million and $221 million for the years ended December 31, 2017 and 2018, respectively.

https://s23.q4cdn.com/407969754/files/doc_financials/2019/ar...

Page 124


So that number includes (and is almost certainly dominated by) the cost of their office space.


Unless that's capex for building, or includes server capex that they depreciate, that seems way high given Uber's scale.

A pretty dense cabinet should only cost ~$1400/mo at wholesale (1MW+ rooms) rates, and $200M is 143,000 cabinets.

And it's public that Uber uses multiple clouds as well.

Disclaimer: I haven't reviewed public Uber filings, would be very interested if there's any data that indicates they're really spending $200M on opex for real estate (which would be equivalent to $400M/year on cloud, which is either opex or potentially mix if there is some reserved instance-type cloud spend).


The 42U rack might be only $1400 per month but the servers to put inside are $5000 upfront per U.


42U times $5000 divided by 3 years is still only $5833 per month per rack. Add the $1400 and we have a total amortized rack cost of $7233 per month. (naively assuming you could fill every spot in the rack with servers).

But servers can't be included in a "Uber spends nearly 200M/yr alone on real estate for their datacenters" figure anyway.


The rack is $2800 per month, not $1400. The lowest advertised pricing does not include enough power to supply half of what's in the rack.

Add $200 per server per month to have a gigabit uplink.

Add $4000 per server per lifetime for VmWare licenses.

These are reasonable estimates of course. Could be multiple of that depending on what hardware is used and what colo. Physical hardware easily gets as expensive as any cloud.


Not easily; you’re making the argument that there can be high expenses for collocating but you’re talking about a -lot- of computational power.

I know that hardware varies a lot for sure, but for context I put together 3 racks 50% density with an 800gbit backplane for around 18k eur/mo.

I spared no expense, official juniper QSFPs (which are egregiously overpriced) and top of the line Dell servers with full out of band licenses.

And once there: interconnected bandwidth and IOPs became “free” (or, no extra charge).

We put the same application in cloud And it costs us 40k eur/mo with a heavy amount of optimisation, with half-sized instances and an aggressive optimisation in bandwidth/iops.

Clouds “sticker price” to us is 4x that of physical. You can buy a lot of human time for that price.


Physical infrastructure is always overprovisioned because it has to be planned long in advance and the smallest unit is huge (8 cores? 128 GB of RAM? for a small server).

If anything that's an argument against physical infra, not in favor of. Although it's fine if one wants a lot of the same big servers (an Hadoop cluster, or video computing cluster, or a CDN), which are the few use cases where physical can make sense (and hybrid cloud probably makes even more sense).

With AWS, you'd never spin half of that infra upfront. You'd spin a few VMs and start running stuff. If the project goes well, spin more or bigger VMs, otherwise spin down. Cost is very dynamic and the company doesn't have to spend half a million upfront, which is a big financial problem for many companies.

Anyway. We can both agree on AWS being overpriced. The reference on costs should be Google Cloud if one is cost conscious, not AWS. Google Cloud is often half the costs of AWS for the same thing.

AWS vs Google comparison, bit old and instance types have changed but relative pricing has not moved https://thehftguy.com/2018/11/13/2018-cloud-pricing-aws-now-...

And this one more recent since they've released high memory instances to compete with SoftLayer: https://thehftguy.com/2018/11/13/2018-cloud-pricing-aws-now-...


My costings are based on GCP; with sublime discounts even.

And, I would really agree with you if not for 2 things:

1) "wasted" cycles are not wasted, the CPU will clock down.

2) Kubernetes was designed specifically for this, the idea is you slice the CPU up so much that you dont waste much resources.

It's astonishingly capable of consuming all resources.

By the by; RAM is never "wasted", it's used for various caches.


As the others say in the comments, understand completely re: servers. And network. And some management overhead. I will write a blog post laying out our COGS at Kentik including all these factors, hope it'll wind up helpful.


> A pretty dense cabinet should only cost ~$1400/mo at wholesale (1MW+ rooms) rates, and $200M is 143,000 cabinets.

The 200M is for a year - at $1400/mo, that's 11905 cabinets.


400M/yr was just Snap’s Google Cloud spend, they also signed a 1B deal with AWS for redundancy.


disclaimer: former uber engineer

we ran the numbers continually and like others mentioned, it was a no-brainer to build on-prem. that said, there are a few use cases where aws/gcp are a great fit. people selling cloud without putting in real engineering work to enterprises generally make my life harder than it already is.


Exactly. Calculating TCO for on-prem vs cloud is tricky. Any time we have done it cloud came out the winner. I also found exceptions: predictable static workloads requiring a huge amount of bandwidth, mostly outgoing. An average company that needs some CPU and storage for various unpredictable workloads benefits from cloud services greatly.


Well those numbers are... weird

Also Uber seems to be much less computing/BW demanding than Snap


@nemothekid Can you provide a source for your claim on Uber's real estate costs for their data centers? I couldn't find anything on Google that corroborates your assertion.


Uber 10-K has operating lease payments in 2019 of $196M and in 2020 of $216M. That's not all data centers but a lot of it is data centers.


Why would you assume that? SF office space goes for way more than data center space.


You can look up the prices of Uber's current office space - they currently pay 16M/yr for their HQ.


If I use S3 + EMR what exactly falls into the category of "a ton of stuff there that you probably don’t use"?

>> Your infra would be better tailored to your workloads

How do you scale elastically with an on-prem infra? What part do you tune more to your own workload when an average company cannot even tune the GC for Java for their own workload?

Your claims do not reflect on reality, it is rather your imagination. I have migrated countless companies to the cloud, and almost every single migration was driven by a single factor: cost. Everything else was an added bonus: inscrased security, availability and elasticity.


> Yes, you’re paying for a ton of stuff there that you probably don’t use

This is the antithesis of any cloud. You only pay for what you use.

> At $5B it would not cost anywhere near that much to replicate.

If you can recreate GCP for $5B in capex, there are likely some VCs lurking here who would like a word with you.


Sorry but that’s not correct. Have you ever looked at the economics of cloud compute at that scale? You absolutely are paying for a lot of tooling that’s irrelevant to your use case. This is because GCP has to be a general provider. They have to offer the broadest possible solution to capture as much share as possible. This means there are money losing services (also low margin) that are offered to provide a complete end-to-end solution. However, the fact that those services exist means that the cost has to be made up for elsewhere. These economics are not some complicated rocket science and exist in many other types of businesses.


Like cable TV.


That's not a perfect analogy but it's easy enough for laymen to understand so I'm stealing it.


IMO, the cloud providers' main advantage is that they will professionally manage all of the underlying hardware. Optionally, they can also manage the low level pieces of software also (e.g. databases, file store, etc.). It's a safe bet that GCP, AWS, Azure et al. can manage/architect their data center much better than the vast majority of companies.


The last point is something that “we”, as an industry, can often be blind to.

If your business is something that is fundamentally not “pushing bytes”, you really, really don’t want to know about routers, firewalls, the OSI layer, SHA256, RAID arrays, all the way to this week’s JS framework. All of that is a big annoyance, and paying AWS to “take care of it” makes sense, even if it comes at a higher price: more often than not, the difference wouldn’t be offset by the time, effort, and risk exposure that you would have to allocate when building your own.

This calculation is different if your business is primarily digital. The gentleman upthread making a game, for example, is perfectly right: his company naturally developed a culture that can evaluate and manage every aspect of its digital operations, because it’s part of its core business, so it makes sense to put that knowledge to good use and save money.


But if you’re incompetent to that degree, AWS isn’t for you because you can’t even write software to run on it. You should be using a fully managed SaaS at that point.


There are three things that make sense to me.

Firstly, if you are big enough, you can manage/architect your data centre better than a cloud provider. You can afford to hire staff to do it, and they can build something specific to your needs. Companies like this should be on-prem.

(It doesn't matter if the company is "digital" or not. It could be a huge retail chain, or a government agency, whatever. The only requirement is to be big enough that you have big needs and can afford a big spend on staff.)

Secondly, if you are small enough, you really, really don’t want to know about routers, firewalls, the OSI layer, SHA256, RAID arrays, etc. Dealing with that would mean another couple of full-time employees, and you can't afford that. Companies like this should be on a SaaS (ideally one less full of gotchas than Heroku).

Thirdly, if you are in between, you have the capacity to deal with system administration, there are some advantages to being able to shape your infrastructure to your needs, but you really don't want to get into real estate and millions of dollars of CAPEX. Companies like this should be on rented physical hardware.

What doesn't make sense, at any scale, is renting VMs.


> Firstly, if you are big enough, you can manage/architect your data centre better than a cloud provider. You can afford to hire staff to do it, and they can build something specific to your needs. Companies like this should be on-prem.

It's interesting that competent engineers is only a question about cost to you. Where I am at (north of Europe) it's really hard to find good engineers. Even though it doesn't mean much about competence it's not mandatory for people in the IT industry to have a CS degree even if they're managing infrastructure in the cloud for millions every year.

I've been working in the industry at different companies for a bit over 10 years and I would say that a huge majority is just an average person who knows basic concepts about infrastructure but wouldn't be able to design and implement any of it by their own. This includes network infrastructure for the biggest ISPs in the country.

I work as a system engineer and the teams I've been working with the last 5 years, independent of company, have been looking for engineers constantly. The pay well and have great benefits but it's just not enough people out there to take these jobs and manage a full on-prem infrastructure.

So a part of the money you're talking about that should be used to hire all these really competent people are instead used to pay for cloud services where we know that professional people with far more hardware (and software) knowledge is managing the infrastructure. And this is really convenient for a lot of companies.


> The pay well and have great benefits but it's just not enough people out there to take these jobs and manage a full on-prem infrastructure.

I’m from a small province in Canada and we have similar problems here sometimes. One of the questions I sometimes have to ask clients when they make a statement like that is: “do you pay well for the area? Or do you pay well enough to attract talent from out-of-province?”

This often leads to them arguing that “but the cost of living is low here!” And inevitably I have to mention my friends who left the province to go to the Bay Area for 10 years and came back with $500k USD in stocks they collected, on top of what they had left over after paying their high cost-of-living rent.

I feel your pain. Companies running in less desirable areas have to somehow pay a premium for top tier talent. One way, as you mention, is to outsource, whether to cloud providers or part time contractors.

This isn’t meant as a brag at all, but I do the independent contractor thing here and make “pretty darned good for the area” money. Inevitably clients will ask me to come work full-time for them, and we have a painful conversation where I tell them what my taxable income was the year before, as well as the investments I made into equipment and licenses I use to provide the services I provide. The result so far has always been to continue being happy with me as a part-time contractor!

Edit: sorry for the typos, written on mobile with the swipe text thing


if it's ok to poke, can I ask what type of equipment and licenses? Is it like a home tech stack to test new infra setups on?

p.s hope you're staying safe from COVID


How much do you think a competent System engineer should be paid?


I think that's way too hard to answer since it depends on so many factors. But again, my experience isn't that the pay is too low. It's not like there are 10 people coming to the interview with no one taking the offer because of the salary. It's more like no one have applied for the job in three months and the external recruiters sniping LinkedIn didn't find anyone.

But to give some kind of estimate just so you know what page I'm at when I say they pay well I would say somewhere between $60-75k.


So I checked on glassdoor, which is known to lowball tech wages, for Washington D.C. So It isn't NY or Silicon Valley/Forest. The Median wage for a System engineer is 182K USD. At the current conversion rate that is 167.7K Euro. Unfortunately, that's your problem.


too low.


> What doesn't make sense, at any scale, is renting VMs.

I disagree. The big cloud providers will let you rent VMs across multiple availability zones in a single geographic region. That gives you better redundancy than you could get by renting one or more dedicated servers in a single data center. Yes, those same cloud providers offer bare-metal instances, but those are absurdly expensive for a company that only needs small-scale computing power but still cares about uptime.


If you're small, you can get that redundancy on a PaaS. If you're big enough to move off a PaaS, you can rent physical machines in multiple datacentres.


Size is not the only thing that dictates whether one should use a PaaS. It's entirely possible for a tiny generalist team, or even a single developer, to develop an application whose requirements prevent it from running on a PaaS. But that doesn't mean the team or solo developer should have to handle all the hassles of operating physical servers. (Source: I was that solo developer for many years.)


Do you mean PaaS, not SaaS? What other good PaaS exist besides heroku?


I did mean PaaS, not Saas! Apologies, and thank you!

I wish i had a good answer about PaaS. There are hosted Cloud Foundry services. I think CF is more sensible than Heroku, myself. The various serverless platforms are PaaS of a specific kind. I think SalesForce counts. Is OpenShift still a thing?


I'm not super familiar with Heroku and similar platforms, but if I don't rent EC2s (which I take to mean what you meant by "VMs"), how do I host and serve bespoke backend components?

Even if all it does is suck data in, perform some computation, and send it off somewhere else - are you saying there are SAAS providers that are better equipped for this?


In many cases yes, but there are shades of grey. Some systems cannot be shared, for whatever reason.


The cloud providers' main advantage when you're unicorn scale is their engineering team acts as an extension of yours. We regularly hop on calls with AWS to scope out features for them to build for us. Our parent org can pretty much snap their fingers and Google will dispatch their engineers to fix/build whatever they wanted.

That and very preferential pricing.


And a lot of people don't need that. There are a lot of applications that are highly cost sensitive but where losing some data or having some unavailability isn't a deal breaker.


Avoiding cloud works well for small requirements too - worked for a company that ran a global store; we've opted for dedicated servers on OVH, all together had about 40 of them. It was 1/10th the price of running the infrastructure on AWS. We still used S3 though.


Yeah, I'm seeing this too.

I recently talked with someone who said their costs dropped in half from switching off a major cloud provider to OVH's dedicated servers. Performance went way up too.

For $200 / month you can get a machine with a high end Xeon 8 core (16 threads) CPU, 128 GB of memory, 4.5 TB of SSD space across multiple disks with unlimited incoming / outgoing traffic and anti-DDoS protection.

No reputable cloud provider is going to come close to that price with their listed prices.

Of course there's trade offs like not being able to click a button and get it within 2 minutes for most server types but if you have consistent and predictable traffic, going with an over provisioned dedicated server seems like a very viable option.


I've worked with several start ups and my experience is the exact opposite of what you said. Due to costs of scale, it is virtually impossible up beat cloud architecture. Ignore the fact that you can get started for free it next to free with cloud. Even as you grow, it is difficult to do it for cheaper.


Did your business get off the ground?


> why does anyone use AWS, you can do this on your own way cheaper.

Because they can calculate TCO correctly.


Is it just me or is Total Cost of Ownership being talked about way less often these days?


It's easy to game. Source: was a sales engineer for ISPs and Data Centers.

So assume you're doing a 3-year or 5-year TCO -- you build in credits, discounts, and bundled options. Execs are looking for a 3-year apples-to-apples spend, and you make your TCO look fucking amazing. They see the low price and decent technical options and they bite.

3 years later those credits vanish and they're paying full OpEx costs. And after 3 years they're now invested -- stuck -- in their space/circuit/whatever. You can start raising the price or negotiating new contracts.

Same thing with the cloud, for that matter. Cut a glorious bulk deal with Microsoft for Azure space, and then after you've moved everything to MS they can start nickle and diming you -- cuz the cost of moving that load to AWS or GCP isn't cheap, and you're not going out and buying more hardware and going back to CoLo are you?


Exactly that. A lot of people do not understand that and get into trap of those credits "thousands of dollars worth" just to be scalped by their vendor-locked costs after those credits vanish.

After all people who make decision on customer side are in same trap - in 3-5 years if not earlier they won't be at the company anymore and it won't be their problem.


Thanks for this perspective, I guess the gameability of the metrics explains a lot.


Yes and also there is a ton of hand-wavy, without any merit bullcrap floating on HN. Without any in-depth analysis, numbers, apple-to-apple comparisons. AWS and Azure would not exist if moving to the cloud was a bad move. There are so many companies whose primary business is not to build data centers, yet they require some computational resources. These are the companies that are the primary users of cloud computing. Based on my experience, most of the financial companies fall into this category, including banks. Other industries that I have experience with include, travel, gaming, pharmaceutical, logistics, and a few more. Microsoft is doing a great job to get the most enterprise customers, while AWS has stronger offerings. The biggest winners with cloud migrations are the companies who can start to auto-scale while previously was impossible because the on-prem datacenter had no such features, even if they had, they could not sell the extra capacity to other companies like AWS. Other great cost optimization opportunities include the option to try to run the workload on different node types and find the best fit. Also not an option with most on-prem DCs. S3 itself can solve problems that are very hard to solve. For example, you have different security zones and you need to copy data around in your DC. This becomes not necessary anymore with S3, just give different users different access level and you do not need to copy the data around anymore. This was one of the big selling points for one of your customers. I could go on and on. In the last 5 years, we have saved several millions of EURs for our clients and made businesses possible that was impossible using on-prem resources. But some HN knows better and they argue that all of these companies which are rolling on AWS are morons and they should have built they own DCs because it is cheaper while AWS became a ~10 billion income business unit. There is some irony in this, I guess.


Gaming company senior reliability engineer/officer here, with the same tune. We're operating thousands of servers on three continents (NA, Europe, Asia), running our own game distribution network and basically doing as much as possible on prem. Every time we're trying to play by the cloud provider rules, we're getting stung with outrageous storage and bandwidth bills. I'm pushing for in-house k8s deployment to increase hardware utilisation and drive our price/performance ratio even lower, but even now we're much better financially purchasing hardware, putting it in leased racks, doing networking ourselves and going full cycle. Probably, if we would be a smaller shop, we'd outsource networking . That's it.


Are you looking at a specific vendor for your on prem k8s?


Not sure why I was downvoted here, I'm in the same boat looking for a solution, not trying to sell something.


Same here - CTO of medium sized company.

Our IT infra costs are 1/10th the cost of cloud, simply because I happen to be comfortable having on-premise machine and working on them (sometime myself).

We have two dozen servers in two locations. It's more time to setup, but maintenance is actually quite low.


On factor for small companies that I have noticed at the places where I have been is that (as a dev consultant). The places that use the cloud often scale up to much. Since the cloud don't limit developers its very easy to just "spinn up a new xx" and not think about the long term costs.

Unless you are really small, have variable workloads then the cloud is maybe not for you. Unless the cost is a small part of the total cost of the plattform like not really related to how many users/sales you have.


I've noticed small companies are more worried about the actual bills coming in where as big companies are more worried about the perceived lack of speed/agility in their development.

We could spend time making sure we use the smallest aurora instances possible in AWS or we could standardize on a version to support and use the smallest size of that version that's available. We spend a lot on this extra capacity in some places, but it greatly increases our speed in other places.

It's all about trade-offs and what your organization values.


I see this a lot in my industry (health).

More often than not, my fellow devs are happy throwing extra servers at a problem instead of tackling the problem :(


Agreed. And it sucks to have to do this, but replacements can be deferred. During a global recession, they may have to be. Servers will eventually degrade (ram goes bad, SSDs fail, spinning rust rusts), but arguably thats better than going poof in the cloud.

I'd rather not get put in that position. Pre-pandemic, business was pretty good, but that seems like an unwise assumption as we slide into a recession.

Maintenance burden that's avoided has two inflection points rather than one. In a large enough data center, there's enough work simply replacing hard drives that justifies having a team that manages it. From the other end, that's just not true if there's a small number of hard drives (like O(1)). There's a middle area though, where there's enough work for more people but the overhead of there being a team is too expensive.


Is this the all in cost? Including lease on land/building, staff, electricity etc...


Yes. Electricity is cheap and it's housed in an existing office(s).

The real cost is IT talent and time. I happen to have homelab experience with HP servers, so (for us) those costs are extremely low.

It's honestly not as complex or costly as cloud providers make it sound.

Replication offsite is easy via s2s vpn and veeam - as is spinning up new VMs for dev or testing. Building already has good a/c and backup power, and hardware/drive failures are rare.


It's likely colocation, getting to the physically building data-centers size is another size up.


24 servers, interesting. That seems like a small enough setup that I would definitely have left it in the cloud. Much more than that and i think it makes sense to start moving to physical servers. But 24 I would have guessed would be cheaper to maintain in EC2.

Are you setup with one or two racks in each location and maybe one full time IT/Sys admin at each location?


24 dedicated dual Xeon boxes can easily be comparable to 100 to 200 m5.2xlarge EC2 instances. So it's relatively small but not tiny, you're going to be paying at least 25k / month for that (inc reserved instance discounts)

So if you use a lot of bandwidth on top of it (can be another 25-50k at AWS) this could reach into to levels where it's worth it to hire two devops guys to run your own in combination with some SLA/management agreements with a colo.


I might need to re-look at this. That's much more efficient at an earlier stage than I was expecting. Thanks!


At my own startup we ran 17 colocated servers, each a high compute resource, at an monthly expense of $600, using purchased hardware that cost $55K. That is versus $96K per month that would be the same infrastructure at AWS.


wow, that is substantial.


> maintenance is actually quite low.

I do hope you regularly patch and reboot your systems.


I hope so too, but it's not like a fleet of fine-tuned EC2 instances running CentOS 6.5 is going to be patched and rebooted regularly without a similar amount of sysadmin attention.

(Not picking on CentOS, just using it as a placeholder for $REQUIRED_OS_BECAUSE_OF_CUSTOM_SOFTWARE.)


The main difference is in admin effort even with the same amount of automation. It doesn't scale linearly because of economy of scales. With reasonable automation a relatively small team can handle a pretty large cloud fleet. For smaller on-prem fleets the scale down might not be ideal though.


Automatic unattended updates are totally possible (I haven’t tried it in CentOS but it works fine on Ubuntu). Even kernel updates can now be done without a reboot. And with the proper setup even reboots can be automated.


But that's not limited to cloud, you can do that for on-premise stuff.


There's a cognitive failure i see a lot where people incorrectly conflate two different new hotnesses like this.

Cloud? Automatically updating operating systems. On-prem? Must be stuck on an old version.

Microservices? You can release to production in minutes. Monolith? Must take you weeks to get a release out.

Go? Simple self-contained deployables. Java? Must need a JDK and Tomcat installed on the server.

It always seem to come up when the leading term is something dubious (cloud, microservices, Go), but the following term is something unequivocally good. Or maybe that's just when i notice it.


How do you do kernel updates without a reboot?

I've looked several times and found different technologies over time (kexec, (Oracle's) ksplice, kGraft, kpatch, livepatch). They do appear to have some use-cases, e.g. delaying the need for a reboot by being able to install a critical vulnerability fix/workaround so that the reboot can be done at a more convenient time. Because many of the patch mechanisms are function-based, they don't appear to solve the general problem in such a way that reboots can be avoided all together for arbitrary large kernel changes. From my reading of the solutions none are at the level of unattended upgrades using apt/yum-cron or similar in a way that "most" can benefit from them without worrying too much about it (ksplice might do it, but not sure how much you need to pay for it for server use and therefore how accessible it is). kexec helps with skipping the bootloader/BIOS, but I'm not sure if it ends up restaring all the systemd services or going up/down the runlevels, some places suggest it reduces downtime but doesn't eliminate it. I've not experimented with any of these myself yet... so I'd be happy to be proven wrong and in any case learn more!

References:

- http://jensd.be/651/linux/linux-live-kernel-patching-with-kp...

- https://linux-audit.com/livepatch-linux-kernel-updates-witho...

- https://wiki.archlinux.org/index.php/Kernel_live_patching

- https://wiki.archlinux.org/index.php/Kexec

EDIT: forgot to mention livepatch


You don’t need to avoid reboot if you have enough machines


This is totally possible on CentOS. Up to CentOS 7, yum-cron lets you do this. With CentOS 8, this has been replaced with dnf-automatic, which offers a lot more flexibility on how to configure automatic updates.


[offtopic]

Awesome seeing you here! I used to be super active on Hypixel when I was in high school (I was actually #1 on the leaderboards for one of the games for over a year). One of my first large scale programming projects ever was creating a Hypixel mod for my guild. Years later I am now a software engineer working at Google on the YouTube algorithm!


Great to hear! Lots of the people who learned how to code because of Minecraft have ended up all over the industry. It's where I got my start, as well as many of my coworkers. That's part of our passion for making Hytale, to empower the next generation of youth who want a game they can tinker with and use to learn. If you haven't seen it yet definitely go take a look, got lots of cool plans for moddability and customization :)


Also offtopic, but my 12 year old is obsessed with Hytale.. Loves to read the blog updates and watch the videos and can't wait to eventually play it :)


As is my 11yr old daughter. Cannot wait to play it. Actually, she’s a budding designer, and all she wants to do is mod it!


Haha same, admittedly I don't have a child, I am the child, I can't wait to get into Hytale and work on mods, servers, whatever. Hypixel is singlehandedly the reason I got into Minecraft and Minecraft is the reason I ended up learning Java. That Java knowledge has earned me a respectable income as I go through college. I owe a lot to Hypixel/Studios man :D


Little late but though I would say hi. I too got started programming thanks to Minecraft. My first real job was working at Overcast Network (oc.tc). I remember having to scale out our infrastructure to seven dedicated servers after a popular YouTuber featured us. At the time that felt crazy for a Minecraft server and here you are now with hundreds of servers. Huge congrats on scaling to where you are today.

Have lots of fond memories of those early years, especially Minecon 2013.


This is the reason I like your site. It not only drags my son into coding (which neither the school nor me to a certain extend could not) but also since he was admitted as a helper (or something like that) he is also learning other skills.

The downside is that this competes with school (thanks god he is very good so that's fine-ish)

The upside is that it is quarantine in France so I can manage this.

Thanks for all of that!


Since we're OT anyway, thank you for all the work you do on the YouTube algorithm. It has brought a tremendous amount of value into my life!


people like the youtube algorithm?


I'll chime in and say hello as well! Great seeing Hypixel staying on top after all these years!


Cost. We currently pay less than €5000 monthly for 500TB/month in traffic and 50 Ryzen CPUs. Amazon would be $30,000 traffic + $100,000 compute.


Do you use any external dedicated host?


Hetzner.de


I think you forgot to include cost of engineers in this calculation.


It's a 1 person part-time job. Most of the work goes into coordinating package updates with Ansible. I can include $1000 in the calculation if that makes you feel better ;)

Everything is deployed with Ansible, monitored with Monit. The servers have redundant PSUs and SATA hot-swappable HDDs, so you can fix minor hardware issues without having to reboot. As the result, we have more than a year of uptime on a typical server.

On the other side, we've been hit with several S3 outages. My feeling is that the cloud needs fully automatic fail-over because it is failing much more often than all our bare metal server combined.


If I could have engineers for 1000$ who are able to build and operate infra better than AWS then Jeff Bezos fortune would look pale in comparison.


An iron rule of these discussions is that when someone objects to non-cloud solutions on the grounds of staff costs, they have not actually calculated the staff costs.


I know currently our problem is finding proper staff and that is why we are considering moving to a cloud provider. Especially for database hosting, using a database as a service becomes very attractive from a staffing perspective just because we can't seem to find good candidates and we've been training younger people into being DBAs but they ultimately want to do something else after a while.


We use Amazon RDS. It's still 10x the price of bare metal, but that is acceptable and it's fast enough for most things. Plus they have automated backups and certified encryption.


RDS and the like still need about the same amount of DBA'ing, no?


I'm assuming it takes care of monitoring and managing space, disks, backups etc. I mean the true administration tasks, not helping with designing the database and queries.

As I said, we are considering but the analysis hasn't been done yet. If you have insight, please.


No.


I'd agree...

It's also for speed to market... Our Infra setup where I work is extremely slow and at times takes over 1 month to setup a database.... and a VM about the same.


Sure, because EC2 instances don't need any admin work. Unless you're doing serverless architectures, you'll still need sysadmins.


Engineers cost $130,000/month? Thats a fulltime salary for 14 network engineers (salary from Indeed).

There is no way you need that many for maintaining a fleet of 50 CPUs (chance is that in OPs case they got duel socket servers as well).


On the other hand, you also need to pay engineers, and rather senior ones, to monitor and upgrade cloud solutions. They are just less visible: typically an extra duty for "devops" programmers and project managers rather than dedicated server farm staff.


$130,000 a month is not 14 network engineers. Salary is not the full cost of an employee to the business.


Even fully loaded it should cover at least 8-9 engineers.


> Cloud is great if your workload is variable and erratic

Or if you're just constantly iterating on a large product with many engineers. Those engineers' salaries almost always outweigh all of your cloud costs and so making them productive is cost effective. Things like SNS/SQS/S3/VPC/ELB/etc. save you countless hours and often make up for the increased cloud costs with increased developer productivity.


> Those engineers' salaries almost always outweigh all of your cloud costs

I think there is a divide here. Most people in this thread that mention cost as an issue are running some serious gear, and that does not come cheap. Your parent is running 70 servers plus some serious networking equipment, which is easily a couple million dollars. And he said that a cloud provider would 10x that cost.

If all you need is a little VM to run a website, cloud hosting is cheap. If you are running real infrastructure, a couple hours of developer time will never out-weight the astronomical costs of cloud hosting.


Not sure that's universally true. We have probably tens of thousands of EC2 instances in our infra right now. Just the cost alone to build out on-prem infra and migrate things over would likely wipe out a half-decade or more of cost savings. And in the meantime our shift in focus would hurt our market position.

If we'd started in the beginning (11 or 12 years ago) with our own infra, I imagine our recurring infra costs would be lower right now, but I also suspect that our 3-person founding team would have failed to produce an MVP before their initial time and money ran out.

If I were starting a company now, I'd probably do things in a more "platform agnostic" way such that a cloud->on-prem migration might be easier. But I still never expect it'd be easy.


If you implement things in a cloud agnostic way you can also save a lot by choosing a traditional webhoster/server hoster. If you want your product to be cloud agnostic, you lose almost every advantage of the cloud.

That's the whole problem why we haven't seen a competitive reduction in prices between cloud providers, yet.


> If you are running real infrastructure, a couple hours of developer time will never out-weight the astronomical costs of cloud hosting.

Over time, buying things yourself will always win out on cloud architecture, or the cloud business would be bankrupt.


It would be theoretically possible for cloud providers, specializing in what they do, to gain from economies of scale of various sorts (having their own large purpose-built datacenters, needing less total redundant equipment to have the same replacement speed, having local servers in all parts of the world, being able to use nighttime idle cycles in servers in one part of the world for daytime customers on the other side), which could theoretically yield savings greater than the profit margin that they charge.

Not saying this is what happens in practice.


To me the economics of scale is the whole point of cloud. If vendors don't deliver on that promise and makes it more expensive than hosting it yourself they have fundamentally failed and there's no reason, market wise or utility wise, for them to continue in the long run.


These economies could only put cloud ahead if they were larger than the provider's profit margin. If cloud saves 10% in costs but charges you 30% in markup, it costs you 20% more.

AWS's margin is 20-25% (eg [1]). Is AWS's cost of goods sold really 25% lower?

Admittedly this calculation is tricky, because it's not clear what to include. Does that figure for AWS's margin include R&D? Should it?

[1] https://www.forbes.com/sites/greatspeculations/2018/10/17/ho...


If one could come up with an estimate of how many hours per year it would cost to maintain (plus initially setup cost), and multiple that by $75/hr (a reasonable rate for a good engineer), then you could estimate how much it all costs.

I'd wager a guess that a properly-configured server won't actually cost that much to maintain. Unless you're frequently updating the OS and other stuff - but that's a software issue. I don't expect the hardware will fail that frequently at all. If there's a HDD failure, a good RAID system (with tolerance for 3 or 4 disks failing simultaneously) will alert you to it, and you'd grab the spare HDD (you should have a few spares), and pop it in, and the RAID would recover from the failure. What other hardware components frequently fail? RAM? CPU? Not really.

The engineer who sets up the server should be someone who is a generalist, and a full-time employee who when not tending to the server, works on other stuff (like the product). Then you have someone on staff who is familiar with the system, and can fix something that goes wrong, without needing to ramp up on it first. I know a lot of programmers would love to set up a server. Any who enjoys buildings PCs (which many generalist programmers do) would probably love to have a one-month project where they pick parts and set up a powerful server.


In quarantine, I’m doing exactly that, fooling about at home with old enterprise gear you can buy seemingly by the pound on EBay now. It’s super fun and I somewhat enjoy getting all the power management right, all the networking setup, and ultimately running Proxmox and a bunch of small containers and VMs for things like Minecraft servers and file storage.

In my day job and my side projects, I’m 100% paying for AWS for 99% of our workloads. I choose to let Amazon deal with all the staff issues and all the other, to use their words, undifferentiated heavy lifting.

There’s some scale point at which you have to ask the question whether you ought to be in the cloud, but IMO if you think a cloud provider is only 50% more than you could DIY, you should probably be in the cloud.


That reads as fairly dismissive to me.

Most people running serious gear have serious engineering payroll overhead.

Sure, some folks are managing something fairly static like game servers, but many of us work for bigco with two dozen products and massive teams.

Even with thousands of ec2 instances we continue to move our on-prem infrastructure to the cloud because developer productivity saves us big money in the long run.


Does it actually save or does it just encourage waste by the developers and poor system design?

I’ve heard stories of companies having each PR spin up something like 20 ec2 instances to build up a whole deployment. A CICD design like that used to be a fireable offense. Now people see that and assume it’s saving them money because it would have bottlenecked before.


That’s a pretty extreme example of CI infrastructure, but okay... so what?

20 t2 mediums costs under a dollar an hour at list rates, and if you’re using a lot of AWS infrastructure can cost a lot less.

How much is the confidence that that PR will not break master worth to you? How many hours of engineer code review time would it take to achieve that same level of confidence in your build that spinning up 20 ec2 instances and running regression tests wins? How much would a failed deployment into production cost you, per minute?

Before you assume that it’s wasteful for any organization to throw cloud at a problem, realize there are many orgs for which a few hundred dollars of cloud compute spend per production release is an entirely reasonable trade off, not a firing offense.


> How much is the confidence that that PR will not break master worth to you?

That’s exactly the mental trap I’m talking about. Requiring 20 instances to try to mirror production instead of getting better testing in place is a sign of testing immaturity.

I have significantly less confidence in the products that use this testing strategy because it really means they don’t have much in place for testing infrastructure to stub things out, inject failures, etc.

And it’s not t2 mediums. It’s whatever is specced for production because we’re such good engineers that we want to test in a production like env, right?

> Before you assume that it’s wasteful for any organization to throw cloud at a problem, realize there are many orgs for which a few hundred dollars of cloud compute spend per production release

That’s not even close. It’s hundreds of dollars a day leading up to thousands per release with larger teams.


So I've setup processes like you described where every PR would spin up 20 ec2 instances, run a huge number of parallel testing, and then shut itself down. The savings associated with developers getting feedback on a full regression test of the system in 20 minutes vs. 400 minutes was significant.


That’s parallelizing, which is not what I’m talking about. I’m talking about pointlessly building up huge environments that “match production”.


If it spins those EC2 instances up, runs for a hour of “waste” and then turns off (or the alternate cluster turns off when no longer in use), it very likely is saving money. It might be saving money in that very moment; it might be saving money invisibly when “time to patch the servers again” is a complete non-event for staff time.


If you have ten thousand EC2 instances, you also have serious engineering payroll overhead.


In that case Use Linode and not AWS - I ported a little POC system I had built to Linode and it cost 10x to run it on AWS.

Though I id get a understanding of the basics of AWS out of it probably not the best use of the share holders money


No they don't . Things you mentioned are very easy to deploy on-premises, and they won't take more than one man month to deploy and maintain.


That's an incredibly naive assessment and is absolutely not true in practice, except for perhaps for some orgs where the stars just happen to align to make that a reality. Most places heavily build "cloud" into their frameworks, tooling, hiring, and, well, everything. Even just swapping out a single hosted/managed component (as my company did a few years ago, replacing Kinesis with Kafka) can require a lot of up-front work[0] and on-going maintenance.

[0] Especially if you're already in production with the thing you want to replace, and want to transition without downtime for your customers.


I love that you think mature products that took teams of highly skilled people at Amazon years to refine can just be slapped together in a month in your colo.

What a joke.


Riiight, because you don't need people to keep the cloud stuff configured and working /s


Yes. I wonder about Second Life, which is doing a "cloud uplift" to AWS. The parts of the system that are variable-load and web-like are already on AWS. But the region servers (one CPU for each 256m^2) are owned outright, and in a colo in Phoenix. They're obsolete machines. But they are compute bound, with a constant load 24/7. Even when no user is in a region, the simulated world continues to run. Uses a bit less CPU, but the physics engine is still doing 45 cycles every second, and all scripted objects are still running. Leaves fall from the trees, plants grow, animals graze, trains run, whether or not any human is watching.

They think AWS will be cheaper. I hope they are right, but have doubts. Fortunately, they're doing this slowly and carefully, and if it turns out that AWS is too expensive, they should be able to move back or to elsewhere. Since what they're doing isn't even close to what AWS normally does, they're not that tied to AWS features.


>> Leaves fall from the trees, plants grow, animals graze, trains run, whether or not any human is watching.

Well that solves the age old question about trees falling in forests :-)


That's a poorly implemented universe. Unless information propagates, it didn't happen ;-)

You could always fast forward the current state when the first external observer arrives. But I get there are time constraints - you can't just stop the real universe until it converges to a consensual state.


This does bother me about the whole "likely we are living in the matrix" thing. If we want to design a universe that does not waste a lot of processing power we need to do things like have a speed of light limitation (keep it fairly slow too). But the sensible idea is to only process lazily - don't calculate till the observer observes - but it is very much like a dependency hell.


I think they need to implement a tree tree.

So if a tree falls in a forest and no one is around to hear it, it won't hit the sound code or the display code.


I remember a post here a while back from a guy running Bitcoin mining in his dorm room. One day he realized he could offer his spare cycles to grad students with high computing workloads and undercut cloud computing prices while increasing what he made far over just mining.

It's very weird that we haven't fully arbitraged $/instruction to a single (low) price yet (or storage/hosting, whichever).

If only there were an Uber for unused cycles or storage... let everyone turn their unused capacity into mini AWSs with a common interface and safety and reliability guarantees.

Maybe there are too many barriers, like security.


Then he got tracked down by the dorm and expelled, because him or his customers were running bitcoin miners. Turns out the dorm has to cover the electricity and that's not a supported use case for them or the dorm insurance.


> If only there were an Uber for unused cycles or storage... let everyone turn their unused capacity into mini AWSs with a common interface and safety and reliability guarantees

This is exactly what Sia.tech is building for storage. A competitive marketplace for storage providers where anyone can set up a host and start selling their spare storage space. It's not completely finished yet, but it's pretty close.

A bunch of companies are already building products on top of it (including me)


how much replication do you need to make this kind of thing reliable? since anyone can pull storage at any time. does that then make transactions super slow? any numbers you can put on this would be helpful. also, whats your product? (just curious)


Sia uses reed-solomon encoding to spread files over 30 hosts with a redundancy of 3x. So Any 20 hosts can fail at the same time and the files would still be available. This does mean that you need to upload all your data 3 times. Sia also needs to monitor your uploaded files all the time to make sure the hosts behave. This means hosts periodically run checksums on the stored data to prove to the client that the data is still there.

The team is currently working hard on performance upgrades. On regular consumer hardware you can currently upload / download data at a rate of about 500 Mbps. The next release is expected to improve this significantly.

Here's an introduction article which explains how Sia works with a bit more depth: https://support.sia.tech/article/dk91b0eibc-welcome-to-sia

My own product is a file sharing website: https://pixeldrain.com. It uses Sia to store large files because it's cheaper than conventional cloud storage. I plan to make it possible to download directly from Sia hosts as well so I can save on bandwidth costs too.


This is really interesting!

How does Sia prevent hosts from precomputing the checksums to fake they are behaving but erasing the data itself? Does it checksum over random ranges of data?

Which source does it use for entropy so that the network remains distributed but nodes can't predict the ranges? Does it use the last block nonce?

Which checksum algorithm does it use? Is care taken as to not be vulnerable to prepend or append attacks from hosts who intend to host data partially whilst pretending they are hosting full data?


Sia founder here. The hashing algorithm we use is blake2b. Definitely secure.

We do probabilistic proofs, so we have the host provide us a small random sampling of actual data (so the host can't rely on precomputing), plus a proof that this actual data is what the contract says the host should be storing.

See chapter 5: https://sia.tech/sia.pdf


I'm not entirely sure on the specifics of storage proofs, but as far as I know it's something along these lines:

When uploading data the renter (that's what we call the node which pays for storage) computes a merkle tree of the data which the host should be storing. When a contract is nearing its end the host will enter a proof window of 144 blocks (1 full day) in which it will need to prove that it is storing the renter's data. The proof is probably based on the block hash of the block where the window started. The host stores the proof in the blockchain and the renter will be able to see the transaction. If the proof matches the merkle tree (which the renter has stored) the contract will end and the host will receive the payment and their collateral back. If the proof is invalid or was not submitted at all the renter can cancel the contract which destroys the funds in it. The host won't get paid and loses its collateral, but the renter also won't get their money back (to discourage the renter from playing foul)

There is some more info on this on the wiki: https://siawiki.tech/about/trustlessness and the website: https://sia.tech/technology. And here is some incomplete technical documentation: https://gitlab.com/NebulousLabs/Sia/-/blob/master/doc/Resour...

If you want to go more in-depth you can go on our Discord where lots of developers hang out, eager to help others to get started with the network :) https://discordapp.com/invite/sia

EDIT: The whitepaper is of course the best source of knowledge. It's quite old at this point but the core principles still apply https://sia.tech/sia.pdf


awesome :) i was able to upload a big file and immediately share it and view it in browser! very nice! what pdf viewer component did you use? i also like the pastebin functionality.

im sure you hear this a lot but... has anyone done a heads up comparison with filecoin?


Thanks, glad you like it. I use Mozilla's pdf.js. It's super simple to implement, just load it in an iframe with the path to the PDF file in an URL parameter. Et voilà, a PDF viewer.

Comparing with Filecoin is hard because there's not much information available about it. The rollout keeps getting delayed too. I know that the founder of Sia has criticized Filecoin's whitepaper a few times because it contains unsolved problems which could cause significant issues during the rollout of the network. Sia took a more conservative approach and worked out all the math before the development of the network started in 2015. Now, 5 years later, Sia has solved all the fundamental issues with the protocols and such and are working on upgrading the performance and building apps on top of Sia's core protocols. In terms of development Sia is about 3 years ahead of Filecoin.


awesome. thanks for your generous replies!!

will spread the word about Sia and Pixeldrain when the topic comes up :)


Remember that the "cost per instruction" ought to be very different depending on a lot of other factors, including but not limited to:

- how much data do you need to move to actually perform the computation

- whether the computation performance is expected to be reliable or if best-effort is acceptable

- whether there are confidentiality and accuracy requirements on the input and output of the computation.

Most software engineering teams are not working at a granular enough level to properly describe which parts of their computations and data are expected to be reliable or not (in availability, confidentiality and integrity). However this can impact the cost per instruction of a computation by multiple orders of magnitude.


> If only there were an Uber for unused cycles or storage... let everyone turn their unused capacity into mini AWSs with a common interface and safety and reliability guarantees.

Funny, this was our idea with first startup 18 years ago, federation of unused storage and redistribution. There were no takers as no one wanted to contribute their unused storage but wanted others. It was much easier to centrally locate our own storage and allocate to users who required it.


I worked for a boss that pretty much demanded that we move to the cloud. I showed them the costs for the then 2 providers (GCP/AWS) and arrived at the exact same conclusion on server hosting alone, as bandwidth wasn't the main driver of our application. The rationale was that we'd save so much money by not having to manage the servers ourselves, but we honestly spent much, much more time in software deployments than managing hardware migrations and failures.


To be clear, it took much more time to deploy your own software to the cloud relative to on-prem?


I interpreted it as: prior to moving to the cloud, server maintenance was not a very significant cost, so the justification for moving to the cloud was weak.


> Cloud is great if your workload is variable and erratic

I would say also, the cloud is cheap if you can shut down or scale-down services when you need it.

IMHO if you ow your software stack, you can use the cloud to rationalize your spending (for instance, Netflix changed their default codec some weeks ago which made them save a lot of monet in egress bandwith)... but nobody can do this.

If your company runs prepackaged software, like an SAP, Manhattan, you don't have margin to shutdown services when they're not needed (at least by now).

Also, after a long time working in a datacenter, I am still ashamed how powerful bare-metal has become: For less than 30K you can get a 1TB server (without storage) with at least 40 cores. Some vendors are even offering hardware as Opex in a pay-as-you-go seting to compete against the cloud.

(disclaimer I work for a Google Cloud partner)


Netflix does not push their videos streams out of AWS (because that would be stratospherically unaffordable for them). They run their apps and systems on the cloud (search and preferences), but the video bits come from Netflix racks distributed to a variety of DCs and ISPs.


> IMHO if you ow your software stack, you can use the cloud to rationalize your spending (for instance, Netflix changed their default codec some weeks ago which made them save a lot of monet in egress bandwith)... but nobody can do this.

Netflix's content serving is done from physical hardware they own and place in datacentres. I don't understand your point.


> Some vendors are even offering hardware as Opex in a pay-as-you-go seting to compete against the cloud.

Isn't that how IBM ran their mainframe business 50 years ago? The more things change...


what i have generally seen is that storage - blob (s3) or database - is the blocker. Even if you go to k8s, the hassle to manage storage is non-trivial.

And because of this singular fact, startups cannot move to on-premise. You really dont want to manage snapshots, restores, backups.

Anyone can manage application servers.


s3 (or similar) and managed database seem to be the most "commoditized" (and cheapest) pieces of the cloud services stack. So it seems reasonable to move everything else to cloud-agnostic/on prem while still leaving those pieces in the cloud.


how do you solve for egress traffic ? the traffic between the server and the database starts being the most expensive part of the stack


I don't understand. You're talking about a filesystem, with an option for backing that filesystem up?


> about 700 rented dedicated machines to service 70k-100k concurrent players

Slightly off-topic, but it sounds like a single machine can't handle more than 150 concurrent players. How is this so?

Is Hypixel Minecraft that resource intensive?

What's the bottleneck — it is the CPU, or RAM, or the network latency(?) per machine, or is it something else?

> We push about 4PB/mo in egress bandwidth

With 70k players, 4PB / 70k is approximately ~57 GB per player per month. That divided by the number of seconds in a month (2.628e+6) is: 57 x 10^9 / 2.628e+6 = ~21.7 kB.

Over a month, on average, a single player consumes a bandwidth of ~21.7 kB/sec, or about 174 kbps. This is an extraordinarily low per-player bandwidth consumption. At 150 players, each machine would average 26 Mbps of network traffic. This is fairly low as well. Of course, the machines have to be capable of handling of possibly much higher peak usage, but even an order of magnitude more — 260 Mbps, is something a Raspberry Pi 4 (which support full-throughput Gigabit Ethernet) can do.


About 60 of those machines are custom L7 frontend load balancers, and an additional 40-50 are database and dev machines. As for the actual gameservers, that's about the numbers we expect on average (150/box), although it varies wildly by gametype. Simple games of Skywars or Murder Mystery with no AI and an ultra-slimmed-down world we can fit 300+ players per box and are CPU-bound. Housing or Skyblock servers with full AI, dynamic worlds, and player-built structures we might only be able to fit 100 players per box and are RAM-bound.

You're pretty accurate with the per player bandwidth. We measure it at an average of 200kbps/online player. We are incredibly compute-intensive, though, and these machines are all operating with E3-1271v3 CPUs, 32GB of RAM, and ~100GB SSD.


Thanks for the insightful comment.

I saw you mentioned in another thread Minecraft performs better when the CPU has better single-threaded performance. I'm guessing, going forward into the future, you'd probably want to build machines with Zen 2 AMD Ryzen CPUs (or future Zen 3 Ryzens), like the Threadripper 3960X (which has a base clock of 3.8 GHz, 24 physical CPU cores, and costs ~$1200 — so $50 for each of those cores). Or, the AMD Ryzen 9 3950X, which is about $40 per core (when you get it on sale), and despite a lower nominal base clock actually performs better on single-threaded benchmarks (e.g. https://www.cpubenchmark.net/singleThread.html).


It's definitely something we're keeping our eye on. The challenge with that is that because Threadripper is so new, it's only just now beginning to be supported on server motherboards. In a few years we're hoping that there'll be enough competition in the AMD Ryzen motherboard market that we can get comparable pricing to what we're getting now for cheap Intel boards.


You should check Hetzner AX series. Even their cheap (39 euro with ecc) ryzen 5 3600 give us 4 times the performance than our E3 1245 cpu.


To elaborate, we've worked with many customers looking to get out of "slinging metal" and focus on what matters to their business most. Some have said that the revenue generated justifies the cost of <enter one of big 3 here> BUT others (and I think you'll see this more over the next decade) are looking at automated bare metal. You get the primitives of the cloud but on physical servers, and more importantly with economics closer to running your own gear. Especially if you consider TCO - power, network, ops people, etc


Packet ^^


Disclosure: I work at Oracle Cloud on the product management side (not sales ;-) )

I'd like to chat with you about this; we charge a flat rate for dedicated connections and about 95-97% less than AWS on egress and are just starting to talk to people about it - it's how we picked up Zoom and 8x8 (bi-directional video has huge egress charges). Let me know if you are open to chatting, there wasn't a "exec team" link on your webpage to reach you directly.


The Hypixel servers are at maximum capacity during the peak 4+ hours every day, meaning additional players cannot join. This is an obvious issue for other applications.


This isn't a limit of our ability to procure hardware, but instead a limit of one of our 7-year-old monolithic applications. We're working on increasing its stability at high player counts, but it's a totally separate issue from server provisioning and cloud vs bare metal.


I see. Out of curiosity, what is the limiting application?


An in-house application that we (I) developed in order to handle game balancing, queuing, and overall network state for a multi-LB Minecraft network. The application's name is MasterControl, and it's architecture and design (or lack thereof, it was literally the first program I ever wrote) has been the source of many headaches over the last few years. Had I known years back that we would see the kind of player numbers that we have this year, we 100% would have completely rewritten it into our modern microservice architecture.


Sorry, but what is LTO in this context? Google turns up Link Time Optimization and Legal Tribune Online and both seem not quite right....


Not OP but he most probably meant lease-to-own. In my experience it's a financing model that spans the equipment's economical lifetime. Let's say 3 years for a server box. In the end of the contract you either pay the last instalment, which is bigger (5x-10x) than usual and keep it, or you return the equipment and get a new one.

There are many variations of this scheme but that's more or less the idea.


Probably "Lease To Own" a.k.a. "hire purchase" which is kinda similar to buying something on long-term credit.


That is where the margins come from on AWS. They are charging 10x the cost of the hardware. Even the AWS well architected framework says that there are always tradeoffs between disaster recovery and cost. For your business cost is more important than having your servers replicated in many time zones.


Servers aren’t replicated. As a matter of fact.. aws began out of unused machines for ephemeral servers


What about the other services? You’re running a Minecraft cluster so you’re fairly detached from what running a traditional service is like these days. Will those same sysadmins make API gateway for me? Or SNS? If you look at AWS and see only it’s EC2 offering then that’s part of the problem. I just moved a customer from onprem to AWS and they were flabbergasted at the performance difference (their processors being 3-4 yrs old at this point) and changed some of their processes from running for hours to a few minutes for zero capex. Then theres adding capacity. Need a new server to spin up your app? Brb while I go call dell, ETA 2 months....


Very late into the discussion. One thing that bothers me a lot is Server Performance / Price hasn't dropped much over the last 5+ years. Memory prices over a long period has essentially been flat. The only thing that has gotten cheaper is NAND. I kind of expect of hoping to have double the core, and memory of a Server for the same price in 5 - 6 years time. And this hasn't happened.


It sounds like you've got a good setup going with colo, but just as a way of illustrating some of the small providers lower bandwidth costs:

DigitalOcean gives 3TB bandwidth on a $15 2CPU/2GBMEM/60GBSSD instance. If you ran 1350 of them it'd cost you ~$20k/month and get you your 4PB egress within bandwidth allowance.


The other problem for _our_ particular use case is that Minecraft is single-threaded, so it performs better on a quad-core 4GHz than an octa-core 2GHz, for example. Because of this, we use almost exclusively E3-12xx lineup CPUs at 3.6GHz+. I don't believe DO publishes their underlying hardware specs, but last time that I ran benchmarks it performed in the 2.0-2.4GHz range, like most cloud providers. Even if their CPUs were identical, it would cost us $112,000/mo at stock pricing, which is significantly more than we're paying for our current deploy.

Your point does stand, though, that not all cloud providers have the bandwidth price gouging. I was mainly referring to GCP/AWS/Azure, who set the trend for the rest of the major providers.


Bandwidth Alliance is only for routing http traffic, but a game server will have udp traffic.


Bandwidth Alliance has nothing to do with this, afaik. Minecraft is a TCP game.


> DigitalOcean gives 3TB bandwidth on a $15 2CPU/2GBMEM/60GBSSD instance.

DigitalOcean is a VPS provider, not a cloud provider, and that's pretty similar to pricing on the nearest comparable AWS service (Amazon LightSail), not something showing DO as notably better (1Core/2GBRAM/60GBSSD/3TB transfer @ $10/mo or 2Core/4GBRAM/80GBSSD/4TB transfer @ $20/mo.)

Though if you are optimizing price per TB of transfer quota, LightSail does best with the 1Core/1GB/40GBSSD/2TB transfer @ $5 instance size.


> DigitalOcean is a VPS provider, not a cloud provider

From the first paragraph in https://en.wikipedia.org/wiki/DigitalOcean

> DigitalOcean, Inc. is an American cloud infrastructure provider[2] headquartered in New York City with data centers worldwide.[3] DigitalOcean provides developers cloud services that help to deploy and scale applications that run simultaneously on multiple computers.

You might have your own cute definition of “cloud”, but it doesn’t match the industry so get over it and stop correcting people.


Who are you getting your bandwidth from?


This! I very much want to hear about how you get your bandwidth.


We actually just get a datacenter blend, but I _believe_ they've got us cordoned off on our own dedicated 40gbps Level3 (now Centurylink) line to prevent us getting attacked from harming other customers. It's not so much about which provider you go to, it's more just about the economies of scale. Our entire fleet exists in one physical DC location due to the requirements of Minecraft's netcode, so rather than having five 10gbps lines at five facilities, we've got one 40gbps line. Because of that the ISPs don't meter by the GB, but by the 95th percentile traffic. At those scales, we're paying pennies on the dollar compared to public "by-the-GB" pricing.


What do you do about backups? I saw you only have one DC, are you worried about that DC being flooded etc and losing all your data?


Offsite AWS Glacier backups, as well as a warm redundancy in another DC.


This is one of the valid use cases of on-prem, predictable high network bandwidth workload. This is the exception though. An average company has the exact opposite. Non-predictable, low bandwidth workloads that are different (like data warehouse vs. website).


Agree on the cost for a sane organization. One draw with the cloud is that not all organizations are sane, so doing the work in house / on prem can be more expensive in terms of money, time, and politics than outsourcing to the cloud.


off topic: could you share how you started in the business? I thought running gaming servers was a low margin business.. Is there a particular size you have to be for it to be worth the pay off?


I'm super curious how you scale these Minecraft servers. I have a 60$ google cloud instance and it can barely handle 3 players with a few mods without lagging a few times per hour.


It's a bit off topic, but the challenge there is similar to a reply I wrote elsewhere in this thread about clock speeds. Minecraft is almost entirely single threaded, which means that you could get a $400 Google cloud instance and still only be able to hold those three players. To scale past that you're going to need higher clock speed processors. Take a look at OVH's "GAME" series offering or any dedicated host with E3's or i7s.


Thank you for all of the great times playing Sky Wars!


Ooh completely unrelated to the thread but thanks for offering my friends and I countless hours of fun during our middle school years!


What about cost of personal (operations/administration) and the cost of maintaining+electricity bill?


I’d be interested in your take on AWS Outposts - do you see potential in this type of offering?


I'll be totally honest, I hadn't heard of Outposts till you mentioned it, but from my quick glance I'd say there's definitely potential in it, but I don't think it fits our use case particularly well.

One of the challenges for us is, to be honest, we're rather spoiled having run exclusively off-the-shelf open-source tech on our own hardware. It's difficult to start paying per million DB queries, for example, when we've been paying a flat rate of $X/mo for the past 7 years. On top of that, our team is very comfortable operating and managing these services in-house, and while it would free them up to focus more on dev tasks, it's a tiny gain compared to the cost increase of moving to SaaS model.

Having said that, though, I'm definitely going to look into Outposts more since it seems much more usable than full-cloud, so thanks for bringing it to my attention!


Are you actually on-premises at your HQ, or are you colocated to a datacenter?


Neither. We've got a long-standing positive relationship with our data center and are operating on the same hardware that we started renting 4 years ago. Because of our initial investment to build up the fleet, we're now basically paying just the power and networking, plus a small markup, for our core fleet and fairly standard pricing for additional month-to-month machines that we spin up and down based on seasonal demand.


Let's not forget about Packet! ;-)


Which dedicated infra host did you use?


We bounced around through a few different hosts in the first two or three years of our operation, but for the past four years we've been consistently with SingleHop, now an INAP company. They've been great through the entire time that we've worked with them and have saved our asses on more than one occasion, especially back before mitigation tech like CloudFlare Spectrum was available.


So what is going to happen now that iNAP has filed for Chapter 11?

( Interesting there is no official words on the website. But that is what Wiki shows. )


thanks for all the detailed info!

i find this hard to square with the evidence that the Netflixes of the world still find it worthwhile to pay AWS then. theres no way that they dont have the same problems you do. care to speculate on why they prefer to pay a vendor? (it seems the Dropbox story is the exception that proves the rule)


Netflix doesn't host its actual content on AWS. They built their own CDN and have their own dedicated servers all over the world. https://openconnect.netflix.com/en/


I work in Livermore Computing at LLNL.

We manage upwards of 30 different compute clusters (many listed here: https://hpc.llnl.gov/hardware/platforms). You can read about the machine slated to hit the floor in 2022/2023 here: https://www.llnl.gov/news/llnl-and-hpe-partner-amd-el-capita....

All the machines are highly utilized, and they have fast Infiniband/OmniPath networks that you simply cannot get in the cloud. For our workloads on "commodity" x86_64/no-GPU clusters, we pay 1/3 or less the cost of what you'd pay for equivalent cloud nodes, and for the really high end systems like Sierra, with NVIDIA GPUs and Power9's, we pay far less than that over the life of the machine.

The way machines are procured here is different from what smaller shops might be used to. For example, the El Capitan machine mentioned above was procured via the CORAL-2 collaboration with 2 other national labs (ANL and ORNL). We write a 100+ page statement of work describing what the machine must do, and we release a set of benchmarks characterizing our workload. Vendors submit proposals for how they could meet our requirements, along with performance numbers and test results for the benchmarks. Then we pick the best proposal. We do something similar with LANL and SNL for the so-called commodity clusters (see https://hpc.llnl.gov/cts-2-rfi for the latest one). As part of these processes, we learn a lot about what vendors are planning to offer 5 years out, so we're not picking off the shelf stuff -- we're getting large volumes of the latest hardware.

In addition to the cost savings from running on-prem, it's our job to stay on the bleeding edge, and I'm not sure how we would do that without working with vendors through these procurements and running our own systems.


I interned at LLNL on three separate occasions during school. Everything about what y'all do in high-performance computing is levels beyond anything I've gone on to see in the "real world". I cannot drop enough praise.


thanks!


In many similar contexts, buying hardware is different than buying services (e.g., AWS time), in terms of how you interact with funding agencies. Because of those realities, you might sometimes be unable to "go cloud" even if it's actually cheaper.

Also, licensing can be a big problem. Some software packages understand licensing for a "real" cluster, but have no concept of how to license for the cloud.

Finally, think about all of your interactions over the last ten years with companies behind the major cloud services. Which of them would you really trust, if it's your money/job/reputation on the line?


I'm a former LLNL lattice QCD postdoc and current user of the LLNL machines (and others). I have to say, I was wildly spoiled.

Every other computing center where I have an account cannot hold a candle to the helpfulness of the LLNL support, the ease of use (except for the 2-factor requirement, which I understand), and absolutely insane power.


> All the machines are highly utilized, and they have fast Infiniband/OmniPath networks that you simply cannot get in the cloud.

It's weird that these networking technologies are not used more in "plain" datacentre settings, since networking latency and throughput has to be a significant challenge to scaling up non-trivial workloads and achieving true datacentre-scale computing. We hear a lot about how to "scale out", but that's only really feasible for relatively simple workloads where you just seek to do away with the whole issue of keeping different nodes in sync on a real-time basis, and accept the resulting compromises. In many cases, that's just not going to be enough.


There are a lot of people from the National Lab super computer world who end up in High Frequency Trading for just the reason you describe.

Specifically, how do you optimize a large cluster of computers to operate at the lowest possible latency. For the National Labs, those computers could be in the lab or with other labs around the world. For the HFT folks, the machines could be in an exchange or spread across multiple exchanges around the world.

Source: I used to be head of Global Latency Monitoring for a HFT.


Hey I’d love to pick your brain about what it takes to do this kind of work coming from a web/systems engineering background. Shoot me an email at samheutmaker at gmail if you’re open to it.


I'm curious why you moved to LLNL from HFT?


Money is a safe guess. Research pay scales aren't even close to private sector, especially not finance.


It sounds like the opposite direction happened here.


Umm, most workloads aren't so compute intense that cpu interconnect speed would have a huge impact. The vast majority of rented computer tasks will be for things that need to communicate with external services (payments, region synchronizing, general api state comms, etc). In these cases, I doubt that your code is tuned tightly enough to reveal interconnect latency (as opposed to cache misses and swaps). Besides, hooking up interconnect runs gets ludicrously complicated and expensive fast. Like, $50k in cables and a week of experimenting with numa zones and kernel settings for a small cluster. I can only imagine the headaches of orchestration on national lab clusters.

Source: I have set up, and currently manage a small cluster which is always pegged. I also collaborate with folks at a couple of the national labs for scientific research.


Infiniband and friends aren't cpu interconnects, they're alternatives to ethernet. Network latency is important and it's not that hard to write code whose performance is bounded by io latency/throughput.


Well they can be both, so I can see how parent might have gotten confused.


True enough.


> it's not that hard to write code whose performance is bounded by io latency/throughput

I suspect that most people don't, though. How many applications are built in Ruby, Python, or PHP, using some very productive but inefficient framework? Those are all very fine tools in their own way, and are the right tool for the job in many contexts. But they're unlikely to saturate IO.


Those technologies, or mostly IB, were more important when Ethernet speeds lagged behind. IB was first to 100G. Now that Ethernet has caught up, IB has almost no advantage besides a tiny latency decrease. Even mellanox makes more Ethernet than IB products now.


I thought process-to-process, Ethernet has still several times larger latency than IB?


We're talking about a few hundred nanoseconds usually. If that matters, then IB is still better.


Most business workloads simply aren't CPU bound, nor amenable to parallel computation. The public clouds would offer it if they had demand for it.


Surely network latency (and perhaps throughput) is just as relevant to IO-bound and transactional workloads though? At least if they need to be served by a multiple-node cluster.


> it's our job to stay on the bleeding edge, and I'm not sure how we would do that without working with vendors through these procurements and running our own systems.

I think this is key. Also the fact that you guys have a land grant so your datacenter costs are cheaper than most. :)


I grew up across the street from the the Lab on (Cheryl Dr.). Now early 30s, I’m a AWS architect and software engineer at a finance company in Bozeman, Montana, but I would kill to be an engineer for pursuit of knowledge ya know? I dream of working there some day.. and I miss home


Hey, I’m in Bozeman too. We should get lunch sometime.


Sorry about a slightly off-topic comment. Working on a deep tech startup, with a significant segment of potential customers being national labs, I'm very interested in learning various aspects [beyond the basics] about the process(es) that national labs use to procure application software (not the system one [OS distributions, monitoring etc.], but rather domain-specific one, e.g., scientific data management, modeling and simulations). I have looked at relevant sections on some of the labs' websites and my initial impression is quite fuzzy. If you are familiar with this subject, I would very much appreciate if you could clarify that. If you prefer, you can reach me at aleksandr <dot> blekh <at> Google's e-mail service.

P.S. I do have connections at some national labs, but thought that you might shed some additional light on this.


I used to work at Pacific Northwest National Laboratory. Their flagship computational chemistry application. NWChem, is developed mostly by their own employees. They also collaborate with researchers from other national labs and academia. Development started in the 1990s. When I was at the lab, it was free-as-in-beer but required onerous licensing agreements to get a copy of the source code. It has been open source since 2007. It has started using github to coordinate development in the last few years.

http://www.nwchem-sw.org/index.php/Main_Page

https://github.com/nwchemgit/nwchem/


I appreciate your comment and links (BTW, I'm slightly familiar with NWChem and similar software :-). Having said that, I'm not sure how your comment addresses my question. You are talking about internally-developed software transfer from the lab to external users, whereas my use case is delivering externally-developed commercial software to national labs' users, a value transfer in the opposite direction. Care to comment on this aspect, if you're familiar with relevant procurement processes?


I can speak at least a little bit about my field (computational chemistry). A lot of academic software is free and/or open source (like NWChem). These are installed by the cluster support for use by anyone. This is helpful because the software can be optimized beyond what most end users can do.

Beyond that, a lot of academics run custom software or uncommon packages. By definition, many academics are doing something no one else has really done, so they are writing their own software and then running that on the clusters.

There is one large commercial package that takes up a significant amount of time on many clusters -- VASP. But no idea the details of how that ends up on HPC systems :)

My impression from running on lots of clusters is that the software installed is more by 'pull' (user request) than 'push' (cluster dictating what is used). Clusters have to be very accommodating of user software.

If you are working in computational chemistry, I would be interested in hearing more details (if you are willing).


I appreciate your feedback. My question above is not about the mechanics of deploying commercial scientific software (it certainly somewhat varies between fields and organizations, though common themes and methods are pretty standard). It is about the procurement processes within national labs ecosystem. That is, about governmental gating / filters "in front of" vendors' go-to-market strategy and processes (RFPs, pilots etc.).

Re: more details - I'm not working in computational chemistry per se, but in the adjacent and closely related, as you understand, field of materials science and engineering. Will be happy to discuss my plans (within limits - currently very early and in stealth) through direct channels. You can connect with me on LinkedIn or shoot me an e-mail (see above).


We do visual GPU graph tech close to core national lab missions (sec, fraud, misinfo, genetics, social, ...) so are in a similar boat.

While I love the folks doing related work at PNNL (and am inspired!), I also recognize, done wrong, we can be existentially threatening to internal groups building related things.

Instead of aggressively persuing contracts and spending effort getting into nasty politics, we let the users come to us: we are there when the internal tools groups rather do other stuff vs do the multi-year investment. We are now focusing a lot more on enabling researchers to find and try us on public cloud / the web -- this road takes literally years longer for reaching those teams, but these are government labs, so that's the default assumption to begin with. (Likewise, not the cornerstone of our revenue.) Years later, folks are starting to write RFPs using us.

As a scientist, I love the community, and respect that we should assist it appropriately.


Thank you for sharing your thoughts. If I understood you correctly (the text's meaning appears somewhat fuzzy), your go-to-market strategy is to first collaborate (and/or consult?) - perhaps, through your "PoCs and Rapid Prototyping" services - and then allow users to "pull" your capabilities and core service into their labs. Right?

Having said that, I haven't seen any science-focused channels, resources or offers on your website ...


I do not advocate a go to market that prioritizes these markets. We started more with enterprise/federal data architects, which involved a different go to market - not self serve on a website.

We (and our broader market) are now at point where it is easier for us to launch self-serve and help adjacent roles. Stay tuned :) This fits better with national labs market too afaict. We already quite heavily discount on-prem for edu etc, and I expect we will see that carry through here: bottom-up early cloud use -> discounted on-prem. As budgets are mostly hw or small, and follow slow grant/annual cycles, hard to see going another way.

Two market trends relevant: -- Federal: without quite special circumstances, can easily take 3+ years for a new fed sales team to pay for itself -- Edu: small budgets to begin with Better to target fast happy-to-spend teams and as tech/sales/product gets easy, easier to help adjacent markets. Got burned whenever we didn't :)


I appreciate your additional insights. Will take time to parse it in detail. Just to clarify, my planned go-to-market strategy is hybrid (white-glove, typical for complex B2B solutions, and self-serve), since my target users - researchers and engineers - are from several major sectors (in the order of current priority): large-to-medium enterprises, national labs, small companies, academia (both research and education) - the last two are of ~ equal priority. Currently, I'm not too concerned about structuring the exact mix as well as tactical details of this strategy implementation - first, I need to validate my core ideas (from scientific and technological perspectives, not from the market validation perspective) and build a decent enough (for B2B enterprise) MVP. Easier said than done ... Oh well. :-)


I misunderstood your original question. When you asked how they procure domain specific application software I thought you meant "how do they get it?," which includes writing it themselves. (Not just procure as in "procurement" processes.)


No problem, we are on the same page now. Thank you, again, for your feedback.


didn't they EOL omnipath last year?


Yes but we still have many OP systems on the floor (see the link above).


Does it run Fortran?


Some, but LLNL codes are mostly C++, especially now that they need to make effective use of GPUs.

More on codes:

- https://computing.llnl.gov/sierra-supercomputer-dedication

Some repos of interest:

- https://github.com/LLNL/axom

- https://github.com/LLNL/RAJA


I'm slowly coming to the complete opposite opinion you seem to have.

I've worked almost entirely for companies that run services in various cloud infrastructures - Azure/Heroku/Aws/GCP/Other.

I recently started a tiny 1 man dev shop in my spare time. Given my experience with cloud services it seemed like a no brainer to throw something up in the cloud and run with it.

Except after a few months I realized I'm in an industry that's not going to see drastic and unplanned demand (I'm not selling ads, and I don't need to drive eyeballs to my site to generate revenue).

So while in theory the scaling aspect of the cloud sounds nice, the reality was simple - I was overpaying for EVERYTHING.

I reduced costs by nearly 90% by throwing several of my old personal machines at the problem and hosting things myself.

So long story short - Cost. I'm happy to exchange some scaling and some uptime in favor of cutting costs. Backups are still offsite, so if my place burns I'm just out on uptime. The product supports offline, so while no one is thrilled if I lose power, my customers can still use the product.

Basically - cost, Cost, COST. I have sunk costs in old hardware, it's dumb to rent an asset I already own.

There might well be a point when I scale into a point where the cloud makes sense. That day is not today.


What's the time trade-off?

I've been drawing out my plans lately for a hobby project, all 100% on AWS.

Being able to spin up my entire infrastructure with Terraform, build out images with Packer, setup rules for off-site backups, ensure everything is secure to the level I want it, etc. - It takes me next to no time at all.

I can't imagine buying hardware, ensuring my home is setup with proper Internet, configuring everything here, and then still needing off-site backups anyway.

Now, keep in mind - I'm definitely coming in from a Millennial point of view. My entire career was built on cloud. I've never touched hardware apart from building a computer back when I was 15 or something. I understand virtual.

But being able to build up and tear down an entire setup, having it completely self-restore in minutes. Can't beat that.

Napkin math has me at ~$50/mo: Full VPC, private/public isolated subnets, secure NACLs and security groups, infinitely extendable block storage and flat-file storage, near-instant backups with syncing to a different continent, 5 servers, DNS configurations, etc.

All depends what you're doing too - of course. But for me, just the trade-off of working with what I know and not needing to leave my cafe of choice, still not breaking the bank - and if I do, having instant tear down and restore. Bam.


Do you think automation doesn't exist with on-prem? The vSphere terraform provider is very mature.

https://www.terraform.io/docs/providers/vsphere/index.html

You can build your own machine images with packer, and actually have more control by building them from scratch (ISO) rather than being locked into a selection of base AMIs.

https://www.packer.io/docs/builders/vmware/

You can launch HA kubernetes clusters with cluster API. vSphere, openstack and even bare metal pxe all have providers

https://cluster-api.sigs.k8s.io/user/quick-start.html

Cloud native is not just for the public cloud!


> You can launch HA kubernetes clusters with cluster API. vSphere, openstack and even bare metal pxe all have providers

I've been looking for a decent solution for pxe and cluster-api. I've not pushed the button on any as I feel like they're all a little immature. Admittedly I've not had time to set up any in a test env yet. Any recommendations?


I'm not going to say that there aren't advantages, but when you start doing anything non-trivial with system configuration you'll find that the "infrastructure as code" approach also involves a large amount of work. Having done plenty of both in my career I don't think it's fair to say that IaC is at all less time consuming than traditional system administration. The advantages it comes with can also pretty easily come out to a wash if you don't plan around to some extent.

I guess what I'm saying is that, just like "cloud means cost savings" doesn't necessarily pan out, "cloud means time savings" also doesn't necessarily pan out. The best strategy continues to depend on your strengths, the application, etc.


You can do IaC for bare metal home servers as well. And thats coming from a 90s kid. We run a few physical servers for a college society with proxmox on top, and all of our VMs are codified in Ansible


I'm a Millennial as well but I started out on hardware.

You're not missing anything. There are still some use cases for on-prem hardware or for not building out on a single cloud provider but they're getting harder and harder to justify. The main use case I can see for running your own hardware is if you need massive egress. There comes a point where it can make sense to run it yourself, I luckily work at places with large enough margins to still leverage cloud solutions for this though.

Some guys don't know how to build things out on the cloud and are stuck in their way. When I was getting into the industry the old guys were still stuck on bare metal vs VMs. You should know how that went, most apps migrated to VMs no problem and a few apps stuck to bare metal for their own reasons.

A lot of small/mid size companies don't have the manpower to properly move an app to the cloud. Sometimes products/companies have enough steam to build an app and barely support it but not quite enough to keep it competitive and current. Basically, it can be easy to get a profitable company/app started with a couple amateurs building the product. It can get very expensive when you need professionals. A lot of places don't have the margins to support a robust team of professionals.

These companies don't always lack the foresight, sometimes it's impossible to get some breathing room to see the big picture or to get project funding instead of just upkeep funding. It can be expensive migrating something to AWS/Azure/etc and not just money but also manpower.


What kind of setup did you have for 5 servers at $50/mo on AWS? Interested to know - our EC2 instances that are about 1/4 as powerful as a laptop cost $60+/mo


Certainly nothing powerful :-) I can get away with t3.nano and t3.micro for what I'm doing at the moment. But the beauty of cloud, is that I can scale up when I eventually need it.

5x t3.nano will be ~$25/mo 5x t3.micro will be ~$50/mo

All of my AMIs are EBS optimized and require a minimum of 8GB for the root drive (Although they only use ~1.6GB. Not bothering to hack around this to save a buck.) So that'll be 40GB EBS block storage. Plus I want ~20GB spread across 3 of the machines.

So EBS should be ~$6/mo.

I only need the volumes of those last 3, the others are good to go with their base AMI or user-data init script. So I only need snapshot backups of ~20GB. Being priced incrementally and having minimal changes, I'll only be charged ~$1/mo for that + off-site another $1/mo

So, currently experimenting with the t3.nano - Cost is ~$36/mo. One of these servers will be used as a personal VPN, and I expect ~75GB/mo coming from my laptop. So bandwidth charges at $9/mo.

Total $45/mo - For what I have planned now, at least.


That's exactly the reason I gave up on AWS, I need an accountant to do the math every month :D

Now I rent a 4GB Linux box for 5$/m with no Dockers or whatsoever and happy that it just works


I also hate this complexity with a passion. I love cloud, but pricing can be a real nightmare.

I don't use AWS specifically but when I needed to know the price of some cloud service or group of services I spin up the service (or services) in a brand new project and let it run for 24 hours under similar working environment to see the impact, then after checking the results (the breakdown of each service's price in that day) I just close the project entirely, no left overs.

So I tend to successfully avoid these strange, terribly organized, cloud-specific, service-specific calculator where I can easily forget one aspect of the service that might cost a lot of money absolutely randomly.

Obviously it is a bad strategy if things are expected to reach $200/month and/or you do 'price evaluation' frequently, but otherwise it is stupid easy. I barely spent $50 each year doing this (small company and sporadic system changes)

But the best part is that the final daily price of your system is as precise as it can possibly be and that is worth something.


And if you went with reserved instances, what does it come out to? I guess that it is fair to assume that you will need at least this level of compute power for at least a year, right?


No idea. Generally reserved instances will give you 30-75% off though, I believe.

This is for a hobby project, so no way I'm committing to that, haha.

I'm also building the architecture so that I can easily completely destroy it and restore it from backups very quickly.

So, depending what I do with it, may be destroying it regularly and just bringing online when I want it.


Cool - I should give the micro instances another shot - I wonder how big of a Postgres database they will be able to handle.


If you have a small hobby project you could use AWS Lightsail[0] which has starting prices of $3.50/mo

https://aws.amazon.com/lightsail/pricing/?opdp1=pricing


Thanks! Aware of Lightsail - But I'm experienced with AWS so have no trouble with, and prefer, the full feature set. I would've gone with Digital Ocean, otherwise.


Out of curiosity, what you're missing at Digital Ocean? It seems it has all the basics (VMs, managed DBs, storage, load balancers) and for a better price than the bigger clouds.

I find it quite easy to mix services from multiple providers. For example add SQS or S3 from Amazon and Elasticsearch from elastic.co if your cloud doesn't have the products. The latency kills mostly just the connection from a web app to its database.

The biggest drawback is multiple invoices, but I think it's manageable.


Consider bare metal at eg hetzner.de - they do everything that is hardware related, but still the cost is way lower than aws.


>Being able to spin up my entire infrastructure with Terraform, build out images with Packer, setup rules for off-site backups, ensure everything is secure to the level I want it, etc. - It takes me next to no time at all.

Interesting. For me, that equation is the opposite. Learning cloud technologies takes a long time. It depends a lot on what skillset you already have.


> Backups are still offsite, so if my place burns I'm just out on uptime

There are plenty of exceptions, but I generally find uptime is given way too much consideration for the value it gives. Most companies up to the national level generally only require about 12 hours a day of uptime from most of their software, more than that is a nice to have, but the company won't collapse if you turn off the servers overnight because no one is working anyway. Even a whole day outage of a critical system can just mean that work has to be delayed for a day and everyone has to work a bit harder the next day. Then as you move down the scale of importance outages can be much longer before people will even notice.

I've seen plenty of "5 nines" style requirements put forth but few that truly need it and even fewer that have considered the costs imposed by it.


Yeah, I find the constant focus on ~100% uptime a bit crazy.

I used to work for one of Australia's largest bookmakers and we would regularly switch off the datacenter overnight to do some work on the servers. It didn't seem to impact the growth of that business one bit. There were times of course where uptime equalled cash, it just wasn't all the time.

For my personal hobby projects, I would actually rather have downtime than a surprise bill if something gets popular.


> I reduced costs by nearly 90% by throwing several of my old personal machines at the problem and hosting things myself.

This is what's pulling me back toward semi-static websites. Do most of the write traffic on premises and push the results to the cloud.

Only part of your application that needs to be always-on. If you treat the entire thing that way, it'll cost you. I've seen a couple of read-mostly applications that are so divorced from this fact that the paths that should be the cheapest are some of the most expensive, and the ones that happen at the speed of human interaction (eg, editing things) are fast. They are in effect backward and inside-out.


IMHO the optionality the cloud gives you is the main value at the low-end. You can play around with different architectures, mixing and matching managed services with no commitment. When you need to scale you can always switch to on-prem, and with the benefit of having a working system to base your hardware requirements on. Of course this migration has costs too, but if this is your strategy you can minimize cloud-specific dependencies and leverage automation that can be used in both environments (eg. ansible).


I pay for one virtual private server instance. I used to colocate several machines but vps is just as good for smaller setups.


University research group here.

Simply, _cost_

Our compute servers crunch numbers and data at > 80% util.

Our servers are optimized for the work we have.

They run 24/7 picking jobs from queue. Cloud burst is often irrelevant here.

They deal with Terabytes or even Petabytes of moving data. I’d cry paying for bandwidth costs if charged €/GB.

Sysadmin(yours truly) would be needed even if it were to be run in the cloud.

We run our machines beyond 4 years if they are still good at purpose.

We control the infra and data. So, a little more peace and self-reliance.

No surprise bills because some bot pounded on a S3 dataset.

Our heavy users are connected to the machines at a single hop :-) No need to go across WAN for work.


In germany it's a pretty common think for universities to have some servers for themself.

1. Their use case is kinda different. The servers mostly run heavy CS research related stuff. E.g. they might have heavy CPU load and heavy traffic between they servers but they have less often heavy traffic to the "normal internet" (if they have heavy traffic to the outside it's normally to other research institutes which not seldom have dedicated wire connections).

2. They might run target specific optimized CPU or GPU heavy compute tasks going on for weeks at a time. This is really expansive in the cloud which is mostly focused in thinks like web services.

3. When they don't run such tasks in the research groups they want to allow their juniors to run their research tasks "for free". Which wouldn't work with a payment model as done in the cloud.

4. They don't want to relay on some external company.

Also I'm not sure are there even (affordable) cloud systems with compatible spec? (like with 4+TB of RAM, I'm not kidding this is a requirement for some kind of tasks or they will take way to long and requires additional complexity by using special data structures which support partial offline data in the right way, which can be very costly in dev time)??


UCSD here. The CS department building has a large room off in the back filled with cages where we rack up our own machines. There's an internal process by which you can request VM allocations, which ultimately also run on top of a set of on-prem servers. So we have both bare metal and a kind of mini managed cloud to work with, without having to go through purchasing approval just to run a fuzzer or do a web crawl.


It's not just CS. The computational chemistry and materials science crystallography folks can have jobs that run for days or weeks too.


I'm at a center for computational biology - our genomics guys have been known to use 90% of our university's HPC capacity ;-) My own work (ecological modelling) is not as heavy, but when I run a full experiment, that takes a 32 core machine about two weeks to complete.


AWS has systems with up to 24TiB of RAM (u-24tb1.metal) [0], but the pricing is "call us". For almost 4TiB of RAM (x1e.32xlarge), it's about $28/hr [1]

[0] https://aws.amazon.com/about-aws/whats-new/2019/10/now-avail... [1] https://aws.amazon.com/ec2/pricing/on-demand/


For a researcher, that means running an experiment changes from "running the experiment" to six months of meetings with higher management establishing business cases, budget justifications, getting competitive quotes from different providers, etc. (in other words it just doesn't happen).

The choice is not between "on premises" and "cloud", it's between "on premises" and "our glossy faculty brochure says you can do this but really you can't".


This struck a nerve with me. I was a SWE at a company that moved to the cloud last year, after having a couple racks at a local colo previously. I was able to experiment and iterate so much faster when we owned hardware, despite all the cloud's elasticity. After we migrated, everything was request, wait, explain, request, wait...

I know some organizations handle this really well, but when mine didn't, it sucked to be a developer there.


It’s extremely common for academic research groups to have their own hardware. Often, it doesn’t make sense from an overhead/maintenance point of view. If you’re a single lab large enough to need dedicated compute, you often don’t have the budget for a good admin. If you’re part of a larger group (like it sounds like you are) it is easier and you can afford some of the extra overhead.

But, I’ve found that there are two reasons why there is still a lot of on-premise academic compute: 1) legal, 2) cap-ex.

For the first, there are still many agreements for data sharing that require strict security compliance. It is certainly possible for this to work on the cloud, but it is more difficult than setting up a hardened cluster that’s not exposed to the internet. IT is normally better setup to audit and approve these on-premises systems than cloud systems.

For the second, it is difficult to estimate and write in all of the cloud costs for a grant budget. Trying to manage op-ex can be more of a hassle than just budgeting X amount for “servers” (as a cap-ex). Often, you just care about getting the job done, but less about how long it takes. So, a set capital expense has lower risk than underestimating your cloud needs and not being able to finish your project. (Also, as a bonus, once the equipment is paid for, you get to keep it to use on the next project, or unfunded research).


When I was in academia I was stuck using the HPC facilities at my institution. I was lucky to get about 30 cores and my jobs took many months to run to get out of the queue (overall I used about a decade of CPU time).

I'm quite sure that 2 months of my time waiting for results (and the impact on colleagues occupying their HPC cluster) was worth the $9k or so it would have cost to rent enough VMs to get my results in about 4 hours or so instead of 2 months.

That's not counting the hours of machine and my time wasted building enough support for the awful queueing and execution environment it used which was almost impossible to mock up locally.


Can you share the architecture and stack?

What form do these jobs have?

How do you manage workloads?

How do you manage resources? Do users have quota for compute and storage?

Do you use GPUs? If so, how do you deal with malfunction?

Is this a distributed processing? Are the machines heterogeneous? What do you use for that cluster?

What if a job requires dependencies? Do you create a compute environment on the fly or do all jobs have the same dependencies and these don't change much?

How do you do data governance? Is the data read only? Do you have an API to fetch the data from the job code?


I wrote about it here:

https://aravindh.net/post/sysadmin/

> What form do these jobs have?

Mostly batch jobs written as bash scripts. Occasionally, some users run singularity containers. But, all through SLURM.

> How do you manage workloads?

SLURM

> How do you manage resources?

As a sysadmin, my inventory is via Ansible. All activity on servers happen via Ansible only.

> Do users have quota for compute and storage?

Yes, Users typically can run 72 cores, 512 GiB of mem at a time. Rest is queued until resources are released. Disk quota is only for home directories - 400GiB(enforced by ZFS refquota).

> Do you use GPUs? If so, how do you deal with malfunction?

No, weirdly our workloads(genetics and genomics) don’t fit the GPUs very well, as they are sparse matrix walks with wide precision floats. But, we plan to try for some other stuff soon.

> Is this a distributed processing?

No. Jobs run one node at a time.

> Are the machines heterogeneous?

Yes, inteL based servers all the way from Haswell to Cascade lake.

> What do you use for that cluster?

SLURM

> What if a job requires dependencies?

Taken care of by SLURM.

> Do you create a compute environment on the fly or do all jobs have the same dependencies and these don't change much?

I guess you mean the software libraries and tools that jobs use? If so, our central software repo is NFS mounted on all compute nodes. Users can install things they need if admin priv is not needed. If not either Singularity or email to me.

> How do you do data governance?

This is the painful and human oriented task. We lock down data transport to outside world, educate the users about data policies and then spend a lot of time looking at data flows with hope.

> Is the data read only? Do you have an API to fetch the data from the job code?

Sorry, I could not understand this question. Do you mean metadata about A job?


I work in a kinda similar environment even if bigger.

To manage runtime dependencies we use cvmfs.

A read only distributed filesystems, kinda like a very specific NFS for software distribution.

If that is never been a problem, cvmfs it is quite stable and very used in our field.


Thank you for the reply and the post. We're running JupyterLab for some students, and when several of them are training models, they consume GPU memory.

Now, we thought of several ways of solving this. We have a branch that uses Kubernetes but this is not the one that is deployed.

We know about SLURM and this is something we will support for a coarse granularity. i.e: notebook level jobs. This is a common scenario and we'll need it for our AppBook (we allow users to turn a notebook into an application with one click and we automatically generate the form fields for the features, and the API endpoint for the model).

However, I'm interested in finer granularity computation; as in: I want a job for the cell. This involves looking into Jupyter kernels to circumvent the front-end<-->kernel disconnection (we'll have to do that for another use case in low internet bandwidth scenario).

>Sorry, I could not understand this question. Do you mean metadata about A job?

How do your users upload, and manage data access and is the data mutable and versioned. What version of the code ran on which version of the data, and who changed that data, etc.

How do users access the data from a Jupyter notebook? Is the data in different object storages? Do you proxy the requests and handle access, things like that.


Yeah, cloud dev-ops and HPC are so similar and still oh-so different (I run a smaller HPC cluster for genomics). When reading these questions, someone with an HPC background would probably already know the answers (or just accept SLURM as an answer for everything).

With respect to the original questions -- I've investigated running a full SLURM cluster on the cloud and it just never seems worth the effort. For larger clusters than mine, maybe, but then when you start to hit the levels you're talking about, I just don't see the point in moving to the cloud. It would be hard to hit that sweet spot where the costs would make sense. Amazon even has a series on scaling a SLURM cluster with AWS EC2 provisioning, but it just seemed like more work than was justified for us.


See reply above[^0]. I'm interested in Jupyter cell level job granularity for "MLOps".

[^0]: https://news.ycombinator.com/item?id=23129417


The article says you're in Aarhus, so I assume you work at Aarhus university? I'm a student from Denmark myself, and I'm curious, which departments mainly need the compute power?


Great article, thanks for sharing :)


I used to work in an environment not dissimilar from what's described above. The uni has some hardware details here: https://in.nau.edu/hpc/details/. Many of my answers would be similar, although the presence of physics, astronomy, and a few other areas motivates GPUs these days apparently. I was doing genomics workloads in the 900-1200gb RAM range with 30+ cores.

There's a pretty interesting NSF-wide project for managing clusters in a more commoditized way as part of https://www.xsede.org. You might describe it as a "private heterogeneous almost-cloud" in its goals? That might be saying a bit much.

Density, latency, storage throughput, etc. were in favor of DIY (plus the pricey professionals to run it) rather than cloud offerings. When I was there (2016) I did some basic math for being able to use a cloud provider for some of our lighter workloads when the local cluster was loaded down. Astronomical without a contract, which is quite the thing to set up, etc.

Worth noting that while they do alright, NAU (first link) is hardly a top-tier university with bleeding edge technical requirements.


Fascinating blog post of yours, thanks for sharing it!


Interesting. How many machines do you have, roughly? Are you the only sysadmin for this? Also, do you have any public facing services hosted from that infrastructure?


I am the only one administering this cluster.

We have approximately 1000 cores across 45 servers.

A few public facing services - web servers, small APIs, a web front end tool for a big genetics database. Nothing big.

Public services are cordoned off from the compute cluster.


We had a similar setup at our University as well. Execs decided we needed to be a part of school-wide clusters though, and are now trying to retire even that computing hardware and push us to the cloud without covering any of the costs.


Meta-comment:

cost: https://news.ycombinator.com/item?id=23098576

cost: https://news.ycombinator.com/item?id=23097812

cost: https://news.ycombinator.com/item?id=23098658

abilities / guarantees: https://news.ycombinator.com/item?id=23097213

cost: https://news.ycombinator.com/item?id=23090325

cost: https://news.ycombinator.com/item?id=23097737

threat model: https://news.ycombinator.com/item?id=23098612

cost: https://news.ycombinator.com/item?id=23097896

cost: https://news.ycombinator.com/item?id=23098297

cost: https://news.ycombinator.com/item?id=23097215

That's just the in-order top comments I'm seeing right now. (please do read and upvote them / others too, they're widely varying in their details and are interesting)

The answer's the same as it has always been. Cloud is more expensive, unless you're small enough to not pay for a sysadmin, or need to swing between truly extreme scale differences. And a few exceptions for other reasons.


There is also another answer... I work for a CDN, so we can't really use the cloud when in many ways we ARE the cloud.

Although we do often make jokes about "what if we just move the CDN to AWS?"


It is a pity East Dakota won’t make jokes like these. Can you imagine cloudflare running on aws? What happens when someone tries to denial of service them while on aws?

On a different note, Netflix still runs “on the cloud”, right? I mean what does it really mean? Dropbox can still have most of its stuff on aws and do the expensive part on premises if cost is a concern?

The truly bizarre stuff happens at hybrid cloud.


Netflix runs it's own bandwidth/cdn. Sometimes it actually has a pop/box INSIDE your ISP https://openconnect.netflix.com/.


My understanding is the website, the user services such as authentication, heart beat (not sure what is the proper technical term but the thing that says where I am in a particular episode). That and internal apps like project tracker not to mention dev/test.

At least in my imagination. At my work, I'm not even worth throwing an SSD at my work computer. My manager is powerless to help as the company has some kind of deal to only buy from HP? No idea what kind of glue procurement is sniffing at this company...


We have around 20 servers in a colo center down the street.

At this number of servers we can still host websites that have millions of users (but not tens of millions). They are not exotic servers either. In fact by now they are, on average, around 11 years old. And costed anywhere from 2k to 8k at the time of purchase. Some are as old as 19 years. Hell, when we bought some of them - with 32GB of memory each - AWS had no concept of "high memory" instances and you had to completely pay out your ass for a 32GB server, despite ram being fairly cheap at the time.

We have no dedicated hardware person. Between myself and the CTO, we average maybe a day per month thinking about or managing the hardware. If we need something special setup that we have no experience in, we have a person we know that we contract, and he walks us through how and why he set it up as he did. We've used him twice in the last 13 years.

The last time one of us had to visit the colocation center was months ago. The last time one of us had to go there in an emergency was years ago. It's a 5 minute drive from each of our homes.

So, why exactly should we use the cloud? We have servers we already paid for. We rent 3 cabinets - I don't recall the exact cost, but I think its around $1k per month. We spend practically no time managing them. In our time being hosted in a colo center - the past 19 years - we've had a total of 3 outages that were the fault of our colo center. They all lasted on the order of minutes.


I think people who have no experience managing servers dramatically overestimate how much time it takes to manage servers. Depending on your team, it can definitely be easier to manage your own hardware than to manage your cloud infrastructure.


There has also been an incredible propaganda campaign by the cloud providers to make this seem incredibly hard.


Agreed, it's just become part of the group think of our times.


The only hard part is finding a provider who won't screw you over. But then again this is the same as if you are to buy a car.

You buy rack space, they give you a network cable and IP details. You configure the OS with that IP and your online.


In my experience, it's not the time required but that a lot of development teams don't have a sysadmin or ops skillset.


I live in a software engineering world professionally but my background is in traditional "neckbeard" Linux system administration. This ends up making me "DevOps" but honestly a lot of what I've ended up doing in my career is basic sysadmin for organizations that get a remarkably long ways before realizing they need it - things like telephony and video surveillance become really unreasonably expensive when you end up relying on a cloud service because you don't have the skillset to manage them in-house.

This is purely my opinion, but I think that 1) there is a strange shortage of IT professionals (people who are not software engineers but instead understand systems) in much of the industry today, and 2) a lot of tech companies, even those that are currently well functioning, might be able to save a lot of money if they hired someone with a conventional IT background. This is a little self-serving of course, but it really does astound me when I see the bills that some companies are paying cloud services to do something that is traditionally done in-house by an IT department. And not everything can readily be outsourced to some "aaS" provider, so on top of that you end up with things like software companies with multi-million budgets running an office network that consists of a consumer WiFi router someone picked up at Fry's - not realizing that they are losing a lot of time to dealing with how poorly that ends up working.

I think part of the problem rests in academia - at least in my area a lot of universities seem to have really backed off on IT programs in favor of CS. I went through an undergraduate program that involved project management, decision analysis, and finance courses because these were considered by the college (I would say accurately) critical skills for the IT field. But that program had an incredible two students and was widely considered inferior to the CS program with hundreds.

Another part of the problem though seems to rest in industry. The salary differential between "DevOps Engineer" and "IT Analyst" is incredible when in practice they end up doing mostly the same thing in a lot of small orgs. So I end up walking sort of an odd line of "I have a long background in IaC but I also know about conference room equipment." And I'm not saying that everything with a Cisco/Tandberg badge isn't overpriced, but Zoom rooms can end up costing just as much and seem to be less reliable - not surprising for a platform which, by practical necessity of the lack of IT support in many orgs, is built on the Silicon Valley time-tested architecture of "five apple consumer products taped together."


From my experience, large enterprises sabotage the effectiveness of internal IT with bureaucracy and politics in a misguided attempt to eliminate all possibility of mistakes being made.

It's usually done with the "let's pretend it is ITIL" process.

Let me give two examples where if I had been the client then I would absolutely have sprinted for the cloud if I could, or at the very least start talking it up as much better.

1) System outage, time to fix 5 hours and 3 minutes. The 5 hours was me sitting in front of my computer with screens open showing the problem and waiting for various managers/decision-makers to fly by and take a look as they were ping-ponging around the office panicking about what would be impacted by the fix. Everything that was going to get impacted was already impacted by the system not working, and I had to explain that to them multiple times. Towards the end of the day, I eventually got the go-ahead to do the 3 minutes of work to fix the system. This system being down had prevented another team from doing any work for the entire afternoon.

2) Two full days of politics and paperwork to get approval to do 30 minutes of work, all while the client was impatiently asking "is it done yet" every few hours.


This is accurate in my experience. Also keeping things up and being there in case shit hits the fan is a full-time job. You can't write features and also manage servers equally well. Unless you have no life I guess.


Totally. I've had personal colo'd servers for 20 years at this point. But I'm tired of knowing that at any point I might have to wake up and haul my ass down to San Jose to swear at some piece of failing gear. I'm excitedly moving it all into the cloud.


There is a middle ground. There are smaller companies where you can rent a bare metal server, usually with unmetered connection (priced by bandwidth). There is an on-call support 24x7 to replace any failing part when you call them. They can build custom servers and also give long term or bulk discounts.

I have a good experience with [1] (a smaller local company) and Hetzner [2], a bigger provider. Compare the prices with cloud. Especially if you need something RAM intensive.

[1]: https://www.superhosting.net/dedicated-servers

[2]: https://www.hetzner.com/dedicated-rootserver


How far are you from you collocation? Is moving that closer an option?


Colocation in SF proper is very spendy, which is how we ended up in San Jose. But waking up to go anywhere at 3 am to swear at gear is no longer on my list of fun activities. And there's no colo that will move the gear along with me when I go on vacation.


> there's no colo that will move the gear along with me when I go on vacation

This is actually the main one for me. I've managed our own servers for a decade with almost zero downtime, and very little time spent at the colo. But you cannot safely go out of town without having someone else around who is familiar enough with your setup to deal with an outage.

So moving stuff to AWS now almost entirely for that reason.


Learn it. Just like a new framework.

It's really not that hard.


We have about 170 servers in a colo about 30 minutes from my home. I can go months without going to the colo. Usually when I go out there it is to replace a disk.

We have a bunch of hardware coming up on end of life. I looked at moving us to cloud. The new hardware will cost about what 3 years of AWS would cost. Considering that the hardware I am replacing is over 6 years old, we are better off sticking with on-prem.


Yep. A lot of issues come from legacy products and configurations. You can do a greenfield infrastructure with OpenStack, VMware, or oVirt/RHEV and it's a pleasure to work with.


Do you have any resources you could link that go through buying and setting up servers?


Counterpoint: The OP is saying they spend 1 day a month managing servers. It's ridiculously little. The servers might as well be unmanaged.

That's not enough time to keep OS or software up to date, or to monitor hardware usage, or to replace failing hardware and deal with spare parts.


This is just not true. A properly set up system requires very little manual work. Updated, monitoring, and alerting are all automated.

I help manage ~50 on prem servers plus NAS, switches, etc. Over the past three years, I have had to replace one hard drive and one power supply. And neither of those brought anything down because of redundancy. We get alerts when something goes wrong, but that is a rare occurrence.

This is my original point: servers won't just explode the minute you look away. It is not a full time job to "manage" them, unless you are at a real datacenter scale where a 1/1000 event happens twice a day. You set them up and then they continue to do what they do for a very long time.


Yes. Why? Cost, availability, flexibility, bandwidth. For a lot of companies, on-prem servers are the best solution for efficiency and cost.

One great example. We were paying $45k/yr for a hosted MS Dynamics GP solution. For $26k we brought it in house with only a $4k/yr maintenance fee. We bought a rackmount Dell, put on VMWare, have an app VM and a DB VM. My team can handle basic maintenance. In the past 11 months we haven't had to touch that server once. We have an automated backup, pulls VMs out daily and sends them off to Backblaze. Even if we need to call our GP partner for some specialized problem, it's not $45k/yr in consulting costs.

We had a bunch of Azure servers for Active Directory and a few other things. When I came in 2 years ago I set up new on-prem DC VMs, and killed out absurd Azure monthly bill, we were saving money by month three. A meteor could take out Atlanta and the DCs are our satellite offices would handle the load just fine until we restored from backups and we'd STILL save money. We've had MORE uptime and reliability since then too.

If I have a server go down, we have staff to get on it immediately, no toll free number to dial, no web chat to a level 1 person in India, etc.

Our EMR is hosted, because that's big enough that I want to pay someone to be in control over it, and someone to blame. However, there have been many times where I'm frustrated with how they handle problems, and jumping from one EMR to another is not easy. And in the end they're all bad anyway. Sometimes I DO wish we were self hosted.

The Cloud is just someone else's computer. If they're running those machines more cheaply than you are, they're cutting out some cost. The question is, do you need what they're cutting?


> The Cloud is just someone else's computer.

Well, not if it's private cloud, and only partially it's hybrid cloud. Cloud is dynamic provisioning, public cloud is cloud that is also on rented hardware.

> If they're running those machines more cheaply than you are, they're cutting out some cost.

Economies of scale are a real thing.


> The Cloud is just someone else's computer. If they're running those machines more cheaply than you are, they're cutting out some cost. The question is, do you need what they're cutting?

They're cutting overhead and getting better deals on hardware than you could ever get.

Their efficiency is their profit margin.


> Their efficiency is their profit margin.

Last time I checked, AWS had a profit margin in the 40%-50% ballpark.

Sorry, but the semiconductor industry doesn't operate with any kind of markup that would allow such profit margins from "getting better deals on hardware". The only one able to make that kind of profit used to be Intel on high-end server CPUs, and even they are now pressured by AMD and custom ARM silicon options. Anything else needed for a server, RAM or flash chips or whatever, is usually selling on thin single-digit margins.

Cloud provider profit margin is perfectly logical and explainable through lock-in effects keeping their customers paying big markups to stay in AWS infrastructure. Be it software that was built against AWS proprietary services, be it having the necessary engineering skills to manage AWS infrastructure in the team but lacking the skills to manage on-prem hardware, be it the enterprise sales teams of cloud operators schmoozing CTOs of big corporations and making them jump on a "going into cloud" strategy as some kind of magic bullet to future-proof their corporations' IT, be it the psychological effect that makes "using the cloud" apparently a mandatory thing to be "cool" in todays' silicon valley culture, and therefore by extension the whole worlds' IT engineering culture.

The most ironical of them all is this weird effect that drives people to rationalize these things, writing comments like yours, because nobody likes to admit they've painted themselves into a corner of lock-in effects. And of course there's the irony of this all being history repeating itself: anyone still remembering when IBM dominated the IT industry?


> that would allow such profit margins

The percentages don't quite hit that amount of discount, but they are much much higher than (I at least) expected.


This person is not payong overhead of developing and running 500'th AWS service with confusing name that they will never need. Just saying 'economies of scale' is a banality, everyone past schoolage knows about them.


> Their efficiency is their profit margin.

Then why does outbound bandwidth cost so much more with AWS than smaller providers?


Mostly, headspace. If I run my own server, I just need to apply my existing Ubuntu sysadmin knowledge. If I use AWS, I have to learn a whole load of AWS-specific domain knowledge, starting with their utterly baffling product names. My time is more valuable than that.

Also, sheer cost. Literally everyone I know in my particular part of the industry uses Hetzner boxes. For what I do, it’s orders of magnitude cheaper than AWS.


I agree, I remember when I learned to use Google Cloud for ML training, it took so much time to learn how to setup everything and to go around all their gotchas. I then tried getting started with AWS, but their poor interface and unfamiliar grounds made me stick to using plain VPSs.


We're a small business and hosting costs are killing us. Spending $300+ on a small VM that I can only describe as measly really hurts.


I have recwntly put together a K8S clyster out of old and bargain basement hardware for hosting personal projects. Equivalent perf. from big cloud would cost me $ 400++


Similar situation here but on a bandwidth level. A 1Gbps leased line costs me ~400 bucks a month and is unmetered even if I keep it saturated for the entire month. I don’t even want to imagine the cost of that on a cloud provider.


What the heck are you running? Ouch.


When you factor in storage, backups, and all the little support costs, things start to get pricey. Just the db can go from 500gb to a tb. Then all of that data came from somewhere and needs to be stored at least for a while.

Never mind the load balancing. And then we need another server for a different client, and another, and a demo server, and test & staging environments. Ugh. I wish this were on-premmable. We could probably hire an employee for the amount we could save. He'd probably be pulling his hair out supporting it tho. lol

My little on-prem HP dev server's got hundred gigs of ram and 8tb of fast storage. We run a bunch of VMs on there. We paid a couple of thousand dollars for it. And I don't even want to think of what it would cost to ship those VMs to the cloud. Nevermind just how poorly they would perform.


That’s how you get old, when your time is more valuable than a massive shift in technology.


Nah, we already did mainframes in the 1970s. Renting CPU time only makes sense if you don’t need CPU time or you like wasting money.


Whether some new technology is worth learning and using depends on your circumstances: would you gain more than it would cost? Understanding this isn't about being old, it's about being wise.


It's a common HN fallacy that everyone here is a developer whose main life aim is to remain employable as a developer.

Lots of us are here to ship stuff; to provide viable products that customers will like.

Moving to AWS from Hetzner would not help me ship stuff. Moving to AWS from Hetzner would not result in a better product.


Try running any service with an average egress exceeding 10 Mbit/s then tell me cloud still makes sense. By the time you reach 1 Gbit/s the very idea of it is enough to elicit a primitive biological defensive response.

We don't do on-prem but we do make heavy use of colo. The thought of cloud growth and DC space consolidation some day pushing out traditional flat rate providers absolutely terrifies me.

At some point those cloud premiums will trickle down through the supply chain, and eventually it could become hard to find reasonably priced colo space because the big guys with huge cash-flush pockets are buying up any available space with a significant premium attached. I don't know if this is ever likely, but growth of cloud could conceivably put pressure on available physical colo space.

Similar deal with Internet peering. There may be a critical point after which cloud vendors, through their sheer size will be able to change how these agreements are structured for everyone.


Netflix runs on the cloud and does 30% of all internet traffic.

That being said, 99% of that traffic is served from servers in colos now, but 10 years ago it was all served from CDN providers like Akamai, which is just a specialized cloud.


This is kind of a big caveat (“Netflix is in the cloud but almost none of the work is done there”), and something I have to mention to non tech decision makers when they say “but Netflix!”. I even have a slide for presentations just for this (“You Are Not Netflix”).


99% of the work is done on the cloud. What comes off of those colo servers is literally just bits streaming from disk to network. There is no transformation or anything. No authentication, no user accounts, no database. Nothing.

Just static files served efficiently.


> What comes off of those colo servers is literally just bits streaming from disk to network

So... the core of their business?


Not at all. It was so “not core” that it was outsourced.

The core of their business is recommendations, encoding, and authentication. All of those are done 100% on the cloud.


Sure but probably 99% of the

> 30% of all internet traffic.

is outside of the cloud.


Anyone know the bill Netflix has for running on the cloud ?


For models where static files need to be served efficiently, using cloud compute and aggressive CDNs works great. For any model where you've got volatile data that can't be cached (Gaming, VOIP, etc) you're tough out of luck.


For sure. There are absolutely use cases where the cloud doesn't make sense.

Most businesses out there aren't those kinds of businesses.


I'm having trouble understanding what point is being made. Netflix entire business is shipping bulk data, I imagine the cost of transcoding and media management to be a rounding error in that grand scheme.

Netflix appears to support what I was saying rather than negate it


The netflix CDN specifically exists because of cloud bandwidth charges.


That’s not true. It exists because Akamai and friends couldn’t serve the data fast enough.


That'a very significant, nevertherless. For workloads that arw not diverae in terms of software, but simply burn CPU, Ram and network, cloud is not cost effective


Do you really think Netflix get the standard billing prices ? I very much doubt so.


I don't think most of the replies realize that you might be pretty familiar with Netflix's infrastructure.


Why stay on premise?

Cost. On-prem is roughly on-par in an average case, in my experience, but we've got many cases where we've optimized against hardware configurations that are significantly cheaper to create on-prem. And sunk costs are real. It's much easier to get approval for instances that don't add to the bottom line. But for that matter, we try to get our on-prem at close to 100% utilization, which keeps costs well below cloud. If I've got bursty loads, those can go to the cloud.

Lock-in. I don't trust any of the big cloud providers not to jack my rates up. I don't trust my engineers not to make use of proprietary APIs that get me stuck there.

Related to cost, but also its own issue, data transfer. Both latency and throughput. Yeah, it's buzzwordy, but the edge is a thing. I have many clients where getting processing in the same location where the data is being generated saves ungodly amounts of money in bandwidth, or where it wouldn't even been feasible to transfer the data off-site. Financial sector clients also tend to appreciate shaving off milliseconds.

Also, regulatory compliance. And, let's be honest, corporate and actual politics.

Inertia.

Trust.

Risk.

Interoperability with existing systems.

Few decisions about where to stick your compute and storage are trivial; few times is one answer always right. But there are many, many factors to consider, and they may not be the obvious ones that make the decision for you.


Cost and Latency.

My team and I run the servers for a number of very big videogames. For a high-cpu workload, if you look around at static on-prem hosting and actually do some real performance bencharking, you will find that cloud machines - though convenient - generally cost at least 2x as much per unit performance. Not only that, but cloud will absolutely gouge you on egress bandwidth - leading to a cost multiplier that's closer to 4x, depending on the balance between compute and outbound bandwidth.

That's not to say we don't use the cloud - in fact we use it extensively.

Since you have to pay for static capacity 24/7 - even when your regional players are asleep and the machines are idle, there are some gains to be had by using the right blend of static/elastic - don't plan to cover peaks with 100% static - and spin up the elastic machines when your static capacity is fully consumed. This holds true for anything that results in more usage - a busy weekend, an in-game event, a new piece of downloadable content, etc... It's also a great way to deal with not knowing exactly how many players are going to show up on day 1.

Regarding latency, we have machines in many smaller datacenters around the world. We can generally get players far closer to one of our machines than to AWS/GCP/Azure, resulting in better in-game ping, which is super important to us. This will change over time as more and more cloud DCs spring up, but for now we're pretty happy with the blend.


AI compute is so much cheaper on-prem that it's not even in question.

And there are clients that demand it.

And researchers, in general, like to do totally wacky things, and it's often easier/cheaper to let us if you have physical access.


+1 on this. Get a nice server with some GPUs and you'd save a lot more than paying the super expensive costs on cloud.


+1 that's one of the reason for us too.

GPU machines are clearly too expensive in most cloud providers compared to on-prem.


Yep. This is where we are at.


I'm in rural iowa and you really can't bank on a solid internet connection. One of my clients decided to iaas-ify all their servers and it works great except when it is windy out. They're on fixed wireless and the remote mast has some sway to it. 3-4 times a year they get struck by lightning and have a total work stoppage until all their outside gear can get replaced. Even their VDI is remote so all the thin clients just disconnect and they are done for the day.

Also, my clients aren't software development firms. They are banks and factories. They buy a software based on features and we figure out how to make it work, and most of the vendors in this space are doing on-prem non-saas products. A few do all their stuff in IAAS or colo but a lot of these places are single-rack operations and they really don't care as long as it all works.

A lot of people in small/midsize banks feel like they are being left out. They go to conferences and hear about all the cool stuff in the industry but the established players are not bringing that to them. If you can stomach the regulatory overhead, someone with drive could replace finastra/fiserv/jackhenry. Or get purchased by them and get turned into yet another forever-maintenancemode graveyard app.


Founder of a growing startup:

Started with a cluster of Raspberry Pis and expanded onto an old desktop. Primarily did this for cost (raspberry pis alone were more powerful than a GCP $35/mo instance). Everything was fine until I needed GPUs/handling more traffic than those Raspberrys could handle. So I expanded by including cloud instances in my Docker Swarm cluster (tidbit: Using Traefik and WireGuard)

So half on-prem half in the cloud. Honestly just scared GCP might one day cancel my account and I'll lose all my data unless I meet their demands (has happened in the past) so that half on-prem stores most of the data.


Honestly just scared GCP might one day cancel my account and I'll lose all my data unless I meet their demands

This hasn't been emphasized enough in these discussions.

Once you're in the cloud, you're dealing with computers as a "utility". Or are you?

When you buy electricity from your local utility, you don't just lose it in the middle of some afternoon for some unknown reason. There's a clear and established procedure, supervised by a Public Utilites Commission. You aren't forced to resort to sending emails and receiving some idiotic canned response from a bot. "Sorry we have reviewed your case and our decision is final. By the way, we don't tell you what you did wrong. Fuck you. Ha. Ha. Ha."

I don't know if Amazon or Azure are better than Google. But I wouldn't trust Google with my worst enemy's data.


> You aren't forced to resort to sending emails and receiving some idiotic canned response from a bot. "Sorry we have reviewed your case and our decision is final. By the way, we don't tell you what you did wrong. Fuck you. Ha. Ha. Ha."

Incidentally, this is exactly what happened to me on my very first attempt with Digital Ocean this year. Just an autoresponder that kept sending the same email with no way to reach a human. And this was just after creating an account and going through an payment authorization process. No money was lost, but I’m not touching a system where humans refuse to get involved.

Edit: To add a comparison on this aspect, Hetzner wanted some verification for a new account and handled it quite well with humans involved (and responsive).


At $35/month though GCP would only have to save you a half an hour of maintenance for it to be worth it though.


Well, given that I am using Docker it doesn't really matter much... but the bigger issue is: GCP in the past has completely blocked access from accounts when they detect random things.

Unless I meet their demands, my entire infra is gone/down for days, which I can't deal with.


You write this as if it's unrealistic. Who spends half an hour or more every month managing each tiny server? The closest server to me right now has probably taken about that long in the past year, swapping out a failed drive that unfortunately wasn't in a hot-swap bay. And it's much more powerful than a $35/month cloud instance.


A half-hour a month seems like a lot of maintenance for a server.


I use a $5/mo DigitalOcean VPS droplet instead of AWS or other "cloud" service. I only have to host an analytics dashboard ( https://usertrack.net/ ), I don't need scaling and this way I know exactly how much I will pay. The resources are more than enough for my needs and I don't think it could be much cheaper even on the most optimized pay-per-minute of use cloud platforms.

I also have some other APIs hosted in the same way (eg. website thumbnail generation API), for the very low traffic I have and no chance of getting burst traffic I think the use case of a VPS or dedicated server is perfect.


Whenever I need to host something small and I’m trying to decide between DO and AWS I always ask myself. Would I rather be surprised by the bill or my website crashing from too much traffic? I almost always pick DO because I don’t want to mess something up and lose a few hundred dollars.


Wholeheartedly agree. I think AWS is moving in the right direction with Lightsail[0], which is a service very similar to DO droplets and includes transfer. Nice if you want to use AWS for like one or two other services, but I tend to still go with DO for small things.

[0]: https://aws.amazon.com/lightsail/


That sounds interesting. By "moving in the right direction" do you mean that it's still in beta or not released yet? Or that it's just the first step of many to come?


Lightsail works well now. It's a little over three years old. In the first year or so after it was released, they were notorious for being slow in most regards compared to their peers (it launched using rebranded instances from AWS, and used spinning disks, going up against SSDs their competitors were all using). They've largely caught up on performance with DigitalOcean, Linode, Vultr and similar.

That said, I've stuck with DigitalOcean even though Lightsail tests fine. I've had a great experience over the years with DO and see no reason to leave.


I use DO now because I like their interface and that they keep adding new features. Plus in my case I can easily spin-up new servers with their prebuilt images marketplace and cloud-init support. I tried using AWS before, but last time I did their interface was awful and pricing hard to understand/predict.


off topic: do you really track the users mouse movements and personal data (IP & such) on your side without asking for permission? Kinda strange these days, isn't it?


I want to start by saying that that is my own analytics platform.

> personal data (IP & such) I agree, Still storing IP is a bit strange. I am still working on the privacy policy and option to not store IP at all. All tracking is session-based (no cookie), and no personal information is stored (only the IP, which is currently only partially displayed, and which is AFAIK not considered personal information).

> track the users mouse movements The site has no inputs, so it only tracks the mouse movements/scroll. Do you consider this to be a privacy concern, especially if the visit is anonymized (can not be linked to yourself as a person)? Seeing a replay of the visit is really useful for understanding what visitors do on the site and how to make their experience better.

> Kinda strange these days, isn't it? I agree, but I still think it's a lot better than other platforms, it complies with GDPR regulations, no cookies involved, no data is sent to third parties, no personal information is being stored (only the previously mentioned IP). I am also continously improving it and adding more privacy-focused features.


I like DigitalOcean... but they are a cloud service too.


Most people I talk to or see online discussing their Digital Ocean usage just use their VPS.

Yes they offer managed databases and kubernetes, but these are newer offerings. I of course don't know how well used they are, but I never see or hear anybody talking about them.

I use them for all my stuff when I'm not on company time, and the reason is because I can just be like "spin me up a Debian box," which I cannot easily do on the things folks typically describe as cloud providers.


I think the terminology is confusing. If I quote Wikipedia, which should never be done: "Cloud computing is the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user".

I personally refer to servers as "cloud" only if they have auto-scaling and usually to services such as AWS or GCP instead of VPSs such as DO, Linode, Vultr, etc.


Now that we're in the habit of quoting Wikipedia... "DigitalOcean provides developers cloud services that help to deploy and scale applications that run simultaneously on multiple computers."


On-prem makes your cost control proactive, rather than reactive. Nobody gets a new server without first having a purchase order approved - and the burden to get that approval falls on the person who wants the server.

In the cloud, at least the way it's generally used, cost control is reactive: You get a bill from AWS every month, and if you're lucky you'll be able to attribute the costs to different projects.

This is both a strength and a weakness: on-premise assets will end up at much higher utilisation, because people will be keen to share servers and dodge the bureaucracy and costs of adding more. But if you consider isolation a virtue, you might prefer having 100 CPUs spread across 100 SQL DBs instead of 50 CPUs across two mega-databases.


Lots of great insights here, which fully accord with my experience, even in the small end of town.

About a year ago, I was in a meeting with my new CEO (who had acquired my company). My side of the business had kept hardware in-house, his was in AWS. We had broadly similar businesses in the same industry and with the same kind of customers.

My side of the business needed to upgrade our 5+ year old hardware. The quote came to $100K; the CEO freaked out. I asked him how much he spent on AWS?

The answer was that they spent $30K per month on AWS.

The kicker is that we managed 10x as many customers as they did, our devops team was half the size, and we were rolling out continuous deployment while they were still struggling to automated upgrades. Our deployment environment is also far less complicated than theirs because there isn't a complex infrastructure stack sitting in front of our deployment stack.

There was literally no dimension on which AWS was better than our on-prem deployment, and as far as I was able to tell before I quit, the only reason they used AWS was because everyone else was doing it.


With all the job hopping that goes on in tech, there is a lot of Resume Driven Development. People want to use AWS because it will help them get their next job.

I'm finally in a job that I'm happy with and can see myself staying here until retirement. I have noticed that has changed my technology recommendations. For example we recently started looking at configuration management tools. Ansible is the obvious choice from a resume perspective as it is very popular. I ended recommending Powershell DSC. Why, because our environment is mostly windows, the team is familiar with Powershell, and for our use case is much faster. Powershell DSC is not as popular so it won't help me get another job. When it comes time to expand the team, I can hire someone who understands configuration management tools or powershell, and get them up to speed in a day or two.


>There was literally no dimension on which AWS was better than our on-prem deployment

The big pitch is opex vs capex. This is one of the leading "value propositions" being made to the C-suite. If you're a business that's worried about capex then opex model will look very appealling... until someone looks at how much you'll (generally) be screwed in the Opex model in the mid to long term. Most 'decision makers' in large corporations are short term oriented and the opex model ends up winning.


We hear from our customers mostly what has been said here: cost and mental overhead. There is a bit of a paradox - companies that plan to grow aggressively are wary of AWS bills chopping their runway in half - they're very aware of _why_ cloud providers give out a year for free to most startups - they recoup that loss very fast once the cash faucet opens up.

What really gets me is that most cloud providers promise scalability, but offer no guard-rails - for example diagnosing performance issues in RDS - the goal for most cloud providers is to ride the line between your time cost and their service charges. Sure you can reduce RDS spend, but you'll have to spend a week to do it - so bust out the calculator or just sign the checks. No one will stop you from creating a single point of failure - but they'd happily charge for consulting fees to fix it. There is a conflict on interest - they profit from poor design.

In my opinion, the internet is missing a platform that encourages developers to build things in a reproducible way. Develop and host at home until you get your first customers, then move to a hosting provider down the line. Today, this most appeals to AI/ML startups - they're painfully aware of their idle GPUs in their gaming desktops and their insane bill from Major Cloud Provider. It also appeals to engineers who just want to host a blog or a wedding website, etc.

This is a tooling problem that I'm convinced can be solved. We need a ubiquitous, open-source, cloud-like platform that developers can use to get started on day 1, hosting from home if desired. That software platform should not have to change when the company needs increased reliability or better air conditioning for their servers. If its a Wordpress blog or a minecraft server or a petabyte SQL database - the Vendor should be a secondary choice to making things.


I've found that Kubernetes mostly solves this problem. I say mostly because for AI/ML workloads that require GPUs, we still rely on running things on bare metal locally, and deploying with GKE's magic annotations and Deep Learning images. But for anything else, I haven't had an issue going all in on k8s at the beginning, even with very small teams.


Yep! My startup is https://kubesail.com, so I agree :)

As for ML on Kube, I agree, there have been and still are some rough edges. The kernel drivers alone make a lot of out-of-the-box Kubernetes solutions unusable. That said, we've had a lot of success helping people move entirely onto kube - the mental gain alone from ditching the bash scripts or ansible playbooks (etC) alone is pretty freeing.


> This is a tooling problem that I'm convinced can be solved. We need a ubiquitous, open-source, cloud-like platform that developers can use to get started on day 1, hosting from home if desired. That software platform should not have to change when the company needs increased reliability or better air conditioning for their servers. If its a Wordpress blog or a minecraft server or a petabyte SQL database - the Vendor should be a secondary choice to making things.

This is the intent of Kubernetes. Not that I particularly like it.


K8s might eventually get there, technically it fits the bipl,but it's got many moving parts thats its rarely worthwhile in a domestica setup


I don't disagree with you (certainly I'm bullish on Kubernetes, hence the startup) but I have to say many people felt similarly about running web-servers in general back in the day. Init script management, maintaining RAID arrays, etc was a project for nerds, not a mainstream activity. Today though, we have replicating databases, we have cloud storage available for backups, we have intelligent disk-replication systems... In my view, it's just a matter of time until Kubernetes can be run by beginners at home. In the late 90s I probably re-installed Linux on my home server at least a hundred times after messing up some component I didn't understand. Kube is no different. Eventually, (I think we're incredibly close), you just install it and forget about it! `microk8s` among others are really close to one-click these days.


a problem with these 'easy' kubernetes deployments, quickstart guides etc is that they have people churning out insecure and nowhere near production-grade clusters that are often using $cloud-provider managed k8s control plane and cloud-specific features & integrations for storage, load-balancing, databases, and more which prevent easy replication of the environment to another cloud provider, let alone a home/lab environment - and it's just as difficult to port a non-cloud k8s setup over to a cloud provider, no matter how many sidecars are tacked on.

this problem extends even further down, because the cloud infrastructure substrate that k8s runs on needs to be configured too, and even when using a tool that supports multiple cloud providers, such as terraform, you end up with lots of cloud-specific code


My main problem with docker/Kubernetes for small projects is automated patching.

With Ubuntu, I can set up automated upgrades/restarts pretty easily but with docker/k8s I end up having to write my own script to handle it, or at least that was the case last time I looked.


Personally, I'd rather have capex than opex.

My observations from working with, and in the "cloud":

The "cloud" does benefit from it's scale in many ways. It has more engineers to improve, fix, watch, and page. It has more resources to handle spikes, whales, and demand. Almost everything is scale tested and the actual physical limits are known. It is damn right impressive to see what kind of traffic the cloud can handle.

Everything in the "cloud" is abstracted which increases complexity. Knowledgeable engineers are few and far between. As an engineer you assume something will break, and with every deployment you hope that you have the right metrics in place and alarms on the right metrics.

The "cloud" is best suited for whales. From special pricing to resource provisioning, they get the best. The rest is trickled down.

Most services are cost-centers. Very few can actually pay for the team and the cost of its dependencies.

It's insane how much VC money is spent building whatever the latest trend of application architecture is. Very few actually hit their utilization projections.


Yes. Three major reasons:

- Cost. It's vastly cheaper to run your own infra (like, 10-100x -- really!). The reason to run in cloud is not to save money, it's to shift from capex to opex and artificially couple client acquisition to expenditure in a way that juices your sheets for VCs.

- Principle. You can't do business in the cloud without paying people who also work to assemble lists of citizens to hand over to fascist governments.

- Control. Cloud providers will happily turn your systems off if asked by the government, a higher-up VP, or a sufficiently large partner.

EDIT: I should add. Cloud is great for something -- moving very fast with minimal staffing. That said, unless you get large enough to renegotiate you will get wedged into a cost deadend where your costs would be vastly reduced by going in-house, but you cannot afford to do so in the short term. Particularly for the HN audience, take care to notice who your accelerator is directing you to use for cloud services -- they are typically co-invested.


Regarding the shift in CapEx to OpEx, on-prem servers can also be leased, keeping their costs in OpEx.


I’ve been studying like a fiend to get AWS certs and thoroughly understand the cloud value proposition, especially for ephemeral workloads and compliance needs. I’m all for cloud solutions that make sense and love when serverless/usage-only systems can’t be deployed. That said, I recently started work on a friend’s system that he has had running in colo for a long time. It’s absolutely insane how long his systems have been up. There are processes that have been alive since 2015, with some hosts having uptime linger than that. He’s got a nice HA configuration but hasn’t had any incidents that have triggered failover. He recently built a rack for his home with 384gb ram and gobs of cpu across 3 nodes, with rack, nice switch and UPS for just shy of $2500 ( he is quite the bargain hunter... ). I did some quick math and found a similarly equipped cluster (of just VMs, not dedicated hosts) has a 1.1 month break-even with on-demand costs, no bandwidth considered. Sure, maybe a 1 year reservation could make it a 2-3 month break even instead, but why? Those machines can easily give him 3-5 years of performance without paying another dime.

If you can feasibly run workloads onpremise or colo and have a warm failover to AWS you could probably have the best of all worlds.


If he has processes that have been up since 2015 how is he patching? That's one of my biggest gripes with on-prem, it's easy to leave something that works alone... until it gets popped by a 5 year old vuln.

In cloud I'm constantly looking at what we have because I have good billing tools in place to see what we're paying for.


I always find it important to separate "cloud" into 2 categories:

1. IaaS - Which I mainly define as the raw programmable resources provided by "hypercloud" providers (AWS, GCP, Azure). Yes, it seems that using an IaaS provider with a VPC can provide many benefits over traditional on-prem data centers (racking & stacking, dual power supply, physical security, elasticity, programmability, locations etc).

2. SaaS - I lump all of the other applications by the hundreds of thousands of vendors into this category. I find it hard to trust these vendors the same way that I trust IaaS providers and am much more cautious of using these applications (vs OSS or "on-prem software" versions of these apps). They just don't have the same level of security controls in place as the largest IaaS providers can & do (plus the data is structured in a way that is more easily analyzed, consumed by prying eyes).


What about first-party SaaS? Those can also be big features that bring people to some cloud providers. Not all SaaS requires you to trust your data/availability to some random vendor. Of course those first-party SaaS aren't typically suitable for lift-and-shift by their very nature, and they can still have some rough edges, but IMO you can expect them to be almost as reliable as IaaS


First-party SaaS meaning things like RDS, DBaaS, queues, LBs etc? Most of that I would sort of put into a IaaS controlled PaaS, rather than true IaaS SaaS. Yes, these are generally higher on the trust spectrum as they don't involve additional vendors accessing/managing/storing data.


A major one I'm thinking of is BigQuery, also of course all the various db/queue solutions outside of your typical S3 clone as you mentioned. That would make sense viewing them as platforms though


I was referring to the IaaS in this question.

As for the SaaS, I guess your mileage may vary. I trust some of them really make a point of securing your data :)


I work for a large video games publisher, as you might expect we use a lot of windows.

Windows server licenses on AWS and GCP are hundreds of times more expensive at our scale. Incidentally we actually do have some cloud infra and we like it, but the licensing cost is half the total price of the instance itself.

In fact, you might not know this but games are relatively low margin, and we have accidentally risked the companies financial safety by moving into the cloud.


TBF windows licensing in general is a shit show, to the point where just handling that is a specialized ability potentially warranting a full time position.


I'm a solo founder who's bootstrapped a saas that's in one state. I'd started out with the cloud then moved to a private cloud in a colocated data center. Saved more than 80% in monthly costs. Got faster speeds, better reliability and a ton of extra compute and network capacity. I just bought used servers from eBay that are retired from big corps. Nothing significant has really changed in the last five years on compute so I'll happily take their depreciated assets :)

__Modern__ servers are really awesome and I totally recommend them. You can do a ton remotely.


Many of the stories here are from large companies, where the costs are quite a different beast. I want to offer an opposite view - from a small company (20-30 people), which is usually the kind of company best suited for the cloud.

We ran a number of modelling jobs, basically CPU intensive tasks that would run for minutes to hours. Investing in on-prem computers (mostly workstations, some servers), we got very solid performance, very predictable costs and no ops issues. Renting beefy machines in the cloud is very expensive and unless you get crafty (spot and/or intelligent deployment), it will be prohibitive for many. Looking at AMD's offering these days, you can get sustained on-prem perf for a few dollars.

Three details of note: 1) We didn't need bursty perf (very infrequently) - had this been a need, the cloud would make a lot more sense, at least in a hybrid deployment. 2) we didn't do much networking (I'm in a different company now and we work with a lot of storage on S3 and on-prem wouldn't be feasible for us), 3) we didn't need to work remotely much, it was all at the office.

Obviously, it was a very specific scenario, but given how small the company was, we couldn't afford people to manage the whole cloud deployment/security/scaling etc. and beefy workstations was a much simpler and more affordable endeavour.


The idea of the cloud is to only pay for what you use. Your on-premise server is idle 99% of the time so why are you paying for a full server?

If that's not true, it turns out it's quite expensive to run things in the cloud. If your workload is crunching numbers 24/7 at 100% cpu, it's better to buy the cpu than to rent it.


> Your on-premise server is idle 99%

That's almost never true for any organisation big enough to start using virtualisation. Typical cluster utilisation levels that I have seen are 20-60%.

It starts to make sense to buy a small 5-node VMware cluster for orgs as small as 100 users, at which point the cloud becomes questionable.

I mean, sure, if you have less than 100 staff and it doesn't make sense to buy a cluster, use the cloud.


Cloud servers tend to be more reliable as well if you don't run your own datacenters. We have lost our internet connection or power 3 times in the last year in the office. Its not the end of the world since we can go to 4g for our own usage but if our servers were hosted locally this would be a huge issue.


Don't forget that between cloud and servers in the company there are still VPS and rented dedicated hardware in a data center.

So you:

1. Don't manage hardware.

2. But manage a server (OS+software stack).

3. Have reliable internet, power and physical security from the data center you are renting your hardware from (if you trust them fully!).

4. Have fixed cost but also fixed resources. Tends to be cheaper for many tasks. Especially CPU/GPU heavy ones.


I consider VPSs to be cloud servers. Is this not common?


I mentioned in another comment that I use VPS and not cloud services. I think of cloud as the auto-scaling infrastructure with dynamic pricing. I think of VPS as just sharing a dedicated machine with others, so each one gets a few cores and shares other resources. The implementation of VPSs nowdays is probably more similar to cloud services, where your own space might be moved around to another physical machine without any downtime.


So you consider cloud servers to be what most people call serverless (S3/serverless functions/etc)?


I do hate the term "serverless" as it makes no sense, but I think of cloud as a system that automatically spins-up/down VPSs based on your current usage. This means the infrastructure/software also allows for automatically load-balancing between those VPSs. So I think of cloud as the VPS servers that are used to host the actual data + the layer on top that does all the scaling, provisioning, load-balancing, etc.


Disagree; your still at the mercy of whatever last mile is __between__ the cloud and the office.

Also any bandwidth limitations; in the US that's still a big deal.

Public facing stuff should probably be colo or rented server.


In my case they are public/customer facing services. Our internal stuff like CI runs on premise.


We spend ~$50k/mo on serverless infrastructure on AWS.

It hurts sometimes, given we were fully colocated about 4 years back, and I know how much hardware that could buy us every month.

However, with serverless infra we can pivot quickly.

Since we're still in the beta stage, with a few large, early access partnerships, and an unfinished roadmap, we don't know where the bottlenecks will be.

For example, we depended heavily on CloudSearch, until it sucked for our use case, so we shifted to Elasticsearch, and ran both clusters simultaneously until we were fully off of CS. If we were to do that on-prem, we'd have to order a lot more hardware (or squeeze in new ES cluster VMs across heavy utilization nodes).

With AWS, a few minutes to launch a new ES cluster, dev time to migrate the data, followed by a few clicks to kill the CloudSearch cluster.

Cloud = lower upfront, higher long term, but no ceiling. On-prem = higher upfront, lower long term, but ceiling.


Your costs are going to skyrocket in a nonlinear fashion as you exit the beta stage.

That’s the biggest counterpoint to using the cloud as a initial setup.


If "Cloud = lower upfront, higher long term, but no ceiling. On-prem = higher upfront, lower long term, but ceiling" is true, then how come the revenue of cloud companies keeps going up?

That would mean the incoming rate of users who are just starting off and find Cloud worthwhile is more exit rate of mature users who are finding on-prem more worthwhile than cloud


If you're spending more than 50k/month on AWS where is the money to move to on-prem? When they got you, they got you.


The (startup) Oxide podcast has good history/stories about on-prem servers, from veterans of pioneering companies. They are fans of open-source firmware and Rust, and are working to make OCP-based servers usable for on-prem. In one podcast, they observed that cloud is initially cheaper, but can quickly become expensive with growth. There is a time window where you can still switch from cloud to on-prem, but if that window is missed, you're left with high switching costs and high cloud fees.

https://oxide.computer/podcast/


Their co founder Bryan Cantrill gave a talk at Stanford on what they are trying to do, essentially offer on prem servers comparable to what “hyperscalers” like Google and Facebook put in their data centers — highly efficient and customizable (in low level software) iirc.

https://youtu.be/vvZA9n3e5pc


Manufacturing. The cost to a factory if the internet is down is too great. Each facility has its own highly redundant virtualization infrastructure hosting 50 to 100 VMs.


I was in manufacturing IT before moving to big tech. Our big campuses in the US & Europe had 40-80mbps internet circuits. The remote facilities in developing countries often only had 10mbps MPLS connections to a regional hub. To be 100% honest, we had 10x the outages caused by crappy local infrastructure than anything having to do with a SaaS service or IaaS/PaaS provider. Seriously, things like bad storms, a snake (cobra!) sneaking into the server room and frying itself and a machine it was snuggling against, utility workers accidentally severing cables, generators failing during power outages, labor strikes, and so much more. Moving to the cloud -- or even just hosting everything centrally -- was much more stable than maintaining a fleet of distributed machines.


Manufacturing Here. This is also the Case with us. Also the cost to run our on-prem in the cloud would be almost 10x what we pay right now. Such an insane amount of CAD Data, there is 0% chance we can get an internet connection that comes close to what on-prem can handle for our massive cad data.


Manufacturing is the one place I still use on-prem, mostly because I work in a field that has compliance reasons for doing so and the, 'internets out, we can't print' problem.


Not sure what you are making but that’s way more advanced than any factory I’ve been to!


I work in gamedev. Build servers, version control, etc. are almost always on-premise, even if a lot of other stuff has been transitioned to the cloud. There's a few reasons:

1) Bandwidth. I routinely saturate my plebian developer gigabit NIC links for half an hour, an hour, longer - and the servers slurp down even worse. In an AAA studio I am but one of hundreds of such workers. Getting a general purpouse internet connection that handles that kind of bandwidth to your heavily customized office is often just not really possible. If you're lucky your office is at least in the same metro area as a relevant datacenter. If you're really lucky you can maybe build a custom fiber or microwave link without prohibative cost. But with those kinds of geographical limitations, you're not so much relying on the general internet, so much as you're expanding your LAN to include a specific datacenter / zone of "the cloud" at that point.

2) Security. These servers are often completely disconnected from the internet, on a completely separate network, to help isolate them and reducing data exfiltration when some idiot installs malware-laden warez, despite clear corporate policy threatening to fire you if you so much as even think about installing bootleg software. Exceptions - where the servers do have internet access - are often recent, regrettable, and being reconsidered - because of, or perhaps despite, draconian whitelisting policies and other attempts at implementing defense in depth.

3) Customizability. Gamedev means devkits with strict NDAs and physical security requirements, and a motley assortment of phone hardware, that you want accessible to your build servers for automatic unit/integration testing. Oddball OS/driver/hardware may also be useful for such testing. Sure, if you can track down the right parties, you might be able to have your lawyers convince their lawyers to let you move said hardware into a datacenter, expand the IP whitelists, etc... but at that point all you've really done is made it harder to borrow a specific popular-but-discontinued phone model from the build farm for local debugging when it's the only one reproducing a specific crash when you want to debug and lack proper remote debug tooling.

...there are some inroads on the phone farms (AWS Device Farm, Xamarin Test Cloud) but I'm unaware of farms varied desktop hardware or devkits. Maybe they exist and just need better marketing?

I have some surplus "old" server hardware from one such gamedev job. Multiple 8gbit links on all of them. The "new" replacement hardware often still noticably bottlenecked for many operations.


There's a renewed interest in on-prem bare metal recently with a lot of different offerings helping to make various parts of the stack easier to manage.

Awesome bare metal is a new repo created by Alex Ellis that tracks a lot of the projects: https://github.com/alexellis/awesome-baremetal

Also we (Packet) just open sourced Tinkerbell, our bare metal provisioning engine: https://www.packet.com/blog/open-sourcing-tinkerbell/


1. 500TB of storage for 3-6mo of CCTV footage.

2. Bought a hanful of $700 24 core Xeons on eBay 2 years ago for 24/7 data crunching. Equivalent cloud cost was over $3000/mo. On-Prem paid off within a month!

3. Nutanix is nice. Awesome performance for the price and almost no maintenance. Got 300+ VDI desktops and 50+ VMs with 1ms latency.


Nutanix is a nightmare when it's over-utilized or you have random node failures. We have 12 nodes and I think lost about 4 man-weeks last year to supporting it.


> $700 24 core Xeons on eBay 2 years ago

Can you clarify? I don't think such a product exists as a single chip at that price point. The Threadripper 3960X costs $1400, and that released less than a year ago.

Edit: Looking up Intel chips on Wikipedia, I think you might be using 12-core/24-thread chips...

https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Xe...


I picked up three Dell R900's for an average of $200 each, with 4x Xeon E7450 and 128GB of ECC RAM. No hyperthreading (2011!), it's 24 real cores.

They're noisy and use lots of power, but you can't argue with the value for money.


I have one similar to this. I don't think it is an R900, but it's an older 1U rack mount. I forget if it's a 12 or 24 core xeon, but it was dirt cheap, came with 72 gigs of RAM, and sounds like a jet engine turning it on. I recently built a Ryzen box with 128 gigs of RAM and it's much quieter...


How high are your power costs?


Manufacturing company with insane deal with local utility. No idea but server room cost is probably under 2% of machinery cost so hard to quantify.


CEO of an AI startup & former AWS employee here.

The cloud sucks for training AI models. It's just insanely overpriced in a way that no "Total Cost of Ownership" analysis is going make look good.

Every decent AI startup––including OpenAI––has made significant investments in on-premise GPU clusters for training models. You can buy consumer-grade NVIDIA hardware for a fraction of the price that AWS pays for data center-grade GPUs.

For us in particular, the payback on a $36k on-prem GPU cluster is about 3-4 months. Everything after that point saves us ~$10k / month. It's not even close.

When I was AWS, I tried to point this fact out to the leadership––to no avail. It simply seemed like a problem they didn't care about.

My only question is why isn't there a p2p virtualization layer that lets people with this on-prem GPU hardware rent out their spare capacity?


Are TPUs too application specific to replace GPUs? It seems cloud TPUs could be competitive with GPUs in terms of $ per number of target epochs, provided you can do data parallelism for your workloads.

Also, IBM offers bare metal pricing which is somewhat cheaper than virtualized instances attached to GPUs (and faster too).

I think GPU virtualization is not quite there yet because Nvidia does not give access to core GPU functionality needed for efficient virtualization - you're stuck with using their closed-source libraries and drivers.


Bandwidth is the big reason to stay on-prem. Good peering contracts can more than make up for any cloud advantages for bandwidth intensive uses.

Now the hard part is turning those cost advantages into operational improvements instead of deficiencies.


Becuse we don't trust a foreign cloud provider with our client's data. Why is that so hard to understand?

All the best cloud providers are from the US and as a european company with clients in european government and healthcare we are often not morally or legally allowed to use a foreign provider.

The sad thing is that this is an ongoing battle between people on a municipal level who believe they can save money in clouds, and morally wiser people who are constantly having to put the brakes on those migration projects.


What about the existing/upcoming technologies of let you use the cloud without trusting it?


Doesn't matter because all US companies are subject to US laws and agencies. Even if the data is in Ireland, they are obliged to cooperate and before you know it all our patient records are leaked in the states.

And speaking of Ireland, I have a memory of a Microsoft EULA saying that even if the data is stored in Ireland they can't guarantee that it won't be transferred to the US.


Small company here (30 total people, 8 in IT/software).

- unwillingness to seed control of the critical parts of our software infrastructure to a third party.

- given our small team size and our technical debt load we are not currently able to re-architect to make our software cloud-ready/resilient.

- true cost estimates feel daunting to calculate, whereas on-prem costs are fairly easy to calculate.


I agree with the last 2 points. Estimates are hard to get right, and rearchitecting an existing app is probably not worth it.

What about your first point though? Do you not trust a 3rd party to maintain infrastructure properly? In what way?


> unwillingness to seed control

Typo alert: you mean “cede control”, not “seed control”.


I'm application architect at enterprise-type org. We have a few SaaS applications, but all the big dogs, including custom dev, run in-house on a dual data centre vmware environment. It's cheaper for us to spin up more VMs in-house, so there's no real cloud driver for things that just live on VMs. On the other hand, our ops team are still working on network segmentation and self-service, but I regularly get a standard VM provisioned in less than 20 mins. If we had to buy new data centres it might be different.

But the real reason we're not deeper in the cloud is that our business-types insist on turn of the century, server-based software from the usual big vendors, and all the things that integrate with them need to use 20th century integration patterns, so for us migrating to the cloud (in stages at least) would have drawbacks from all options without the benefits. It's only where we have cloud-native stuff that we can sneak in under the radar for stand-alone greenfields projects, or convince the business types that they can replace the Oracles and Peoplesofts with cloud-first alternatives will things really change.


Last company I was at was more or less as you describe.

Now.. at a company in a different industry, there's a 5+ year plan to move 100% to cloud. Nascent efforts are about 18 mos old already, no apps are live yet.

Fortunately, they've been using a container approach for their on-prem stuff for a while, so some stuff can move over pretty easily, a lot of things will get a touch-up or more interesting upgrade along the path to the cloud environment.

Not even talking about de-commissioning the DCs yet, but those will get defunded as things go on.


Security... not against hacker but against provider's government.

I worked for some french or european companies, with IPs and sensitive informations, and US business competitors. By the US law, US companies may have to let the US gov spy on their customers (even non US, even on non US location), so this may be a problem for strategic sectors, like defense for example.

In that case, sensitive informations is required to be hosted in country by a company of the country, under the country law.

Of course, it's not against "cloud" in general... only against US cloud providers (and chinese, and...)


I work on two projects, neither of which use cloud.

For my day job, it is privacy and legal constraints. I work for the government and all manner of things need to be signed off on to move to cloud. We could probably make it work, but in government, the hassle of doing so is so large that it is not going to happen for a long time.

In my contract project, it is a massive competitive advantage. I won't go into too many details, but customers in this particular area are very pleased that we do not use a cloud provider and instead host it somewhat on-premise. I don't see a large privacy advantage over using the cloud, but the people buying the service do simply because they are paranoid about the data and every single one of them could personally get in a lot of trouble for losing the data.

Not my project, but intensive computing requirements can be much more cheaply filled by on-premise equipment (especially if you don't pay for electricity), so my university does most of its AI and crypto research on-premise.


My company moved from all on prem, to all in AWS. Having used both, I'd much rather go back to on prem. I did architecture, capacity planning, system tuning, deployments, etc. I knew everything about all of them, and treated them as sacred. The next generation came in, deciding not to learn anything about systems and brought in the 'systems as cattle' attitude and everything that comes with it.

I try to remain objective, there are some pros to AWS, but I still much prefer my on prem setup. It was way cheaper, and deployments were way faster.


- Uptime

The number and frequency of outages in Azure are crazy. They happen non-stop all year around. You get meaningless RCAs but it never seems to get better, and if it did, you'd have no way of knowing.

Compare this with doing stuff internally - you can hire staff, or train staff, and get better. In the long run outsourcing and trusting other companies to invest in "getting better" doesn't end very well. Just because they moved their overall metrics overall from 99.9 to 99.91 may not help your use case.

- Reliability

Their UIs change every day, there's no end to end documentation on how things work. There's no way to keep up.

- Support

Azure's own support staff are atrocious. You have to repeatedly bang your head against the wall for days to get anyone who even knows the basic stuff from their own documentation.

But it's also difficult to find your own people to do the setup too. Sure, lots of people can do it, but because it's new they have little experience and end up not knowing much, unable to answer questions, and building garbage on the cloud platform. Because there's no cloud seniority, it hasn't been around for long enough.

- Security

Cloud providers have or can get access and sometimes use it.

- Management

I've seen too many last minute "we made a change and now everything will be broken unless you immediately do something" notifications to be happy about.

- Cost

It's ridiculously expensive above a certain scale, and that scale is not very big. I don't know if it's because people overbuild, or because you're being nickel-and-dimed, or if you're just paying so many times above normal for enterprise hardware and redundancy. It's still expensive.

Yes, owning (and licensing) your own is expensive too.

For smaller projects and tiny companies, totally fine! It's even great!

- Maturity

People can't manage cloud tools properly. This doesn't help with costs above.

PS: I don't think any other cloud service is better.


Not exactly on-premise but we rent 2 big dedicated server (ovh) + install VMWare ESXi on them. Going to the cloud would cost more, the price would be unpredictable, only to solve a scale problem we won't have. And customers love to know their data are hosted in France by a French company, not by Google or Amazon :)


Network storage. MacBooks don't have a lot of space and artists like to make HUGE psd files.

Bonus points for small stuff like RADIUS for wifi and stuff. Groups charging $5/user for junk like that is absolutely awful with a high number of staff.

With a staff of 100, a single box with a bunch of hard drives is two months worth of cloud and SaaS.

TCO needs to come down by like at least 100x before I consider going server-less.


We own infra because we need to own and control it and also because it's just vastly cheaper at the scale we use.

Besides, we do have things like our own S3, k8s and other cloud-ish utilities running so we do not miss out that much, I guess.


This conversation sounds a lot like the debate around globalization and outsourcing manufacturing to me. It might be a stretch but there is something here.

There is room for both Cloud and On-Prem to exist. This endless drive by industry to push everyone to cloud infrastructure and SaaS, in my humble opinion, will look exactly like the whole supply chain coming from the east during a pandemic.

The economics of it look great in a lot of use cases, but putting our whole company at the mercy of a few providers sounds terrible to me. Even more so when I see posts on HN about folks getting locked out of their accounts with little notice.

It does not take much to bring our modern cloud to a grinding halt. For example, a mistake by an mostly unheard of ISP lead to a massive outage not less than a year ago(1).

It was amazing to see the interconnections turn to cascading issues. 1 ISP goofs. 1-2 major providers have issues and the trickle down effect was such that even services that thought they were immune from cloud issues were realizing that they rely on a 3rd party that relies on a different 3th party that uses cloudflare or AWS.

So, even though I think the cloud is (usually) secure, stable, resilient, etc... I still advocate for its use in moderation and for 2 main use cases.

1 - elastic demands. Those non-critical systems that add some value or make work easier. Things we could do without for several days and not hurt the business much.

2 - DR / Backup / redundancy. We have a robust 2 data center / DR fail over system. Adding cloud components to that seems reasonable to me.

(1)https://slate.com/technology/2019/06/verizon-dqe-outage-inte...

Edit: Spelling and clarity

Edit2: New reasons to stay on prem are happening all the time. https://www.bleepingcomputer.com/news/security/microsofts-gi...


I worked for a famous large tech company that makes both hardware and software. On stuff that runs on customer datacenters.

There are plenty of companies that run their infrastructure to keep their data secure and accessible.

It's not the type of companies that blog about their infra or are popular on HN. Banks and financial institutions, telcos, airlines, energy production, civil infrastructure.

Critical infrastructure need to survive events larger than a datacenter outage. FAANGs don't protect customers from legal threats, large political changes, terrorist attacks, war.


One of the reason is a legacy systems. Some companies are too tied up to old custom made systems built on old software and hardware virtually not convertible to cloud. You will be surprised but there are big corporation still using AS400 and not planning to switch anytime soon. If you heard in recent news US unemployment system was still built on COBOL... In 2020... Another reason is a cost. I love AWS! Its fantastic to be able to create and launch servers or the whole farm of servers in the matter of minutes! And ability to convert physical servers to virtual and upload them to the cloud is breathtaking! But my monthly bill started at $300 and grew to $18K per month in less than 3 years. And that was for just a few virtual servers with Windows OS and SQL. My company realized that we can have a state of the art datacenter with WmWare and SAN on premises for the fraction of that price. Put second one on the other coast (one of our other offices) and you have your own cloud with better ping and six digits figure saving a year. For the last I would name vendor lock. With vSphere its very easy to move your virtual servers between AWS, Azure and Google (assuming you can afford all 3 and licensing cost of WmWare) but have you ever tried to "download" your server back to premise? It's virtually impossible or made so hard by cloud players trying to keep you up they're in the clouds. With all said I read that Netflix (I believe its Netflix) saving hundreds of millions dollars per year by using Amazon instead of its own servers. I also read somewhere that Dropbox moved away from AWS...


The way I think about it this: not using the cloud is like building your own code editor or IDE after assembling your own laptops and desktops. It may make you happier and it’s a great hobby, but if you’re trying to run a business you need to do a cost benefit analysis.

We currently have double digit petabytes of data stored in our own data centres, but we’re moving it to S3 because we have far better things to do with our engineers that replace drives all day, and engineering plus hardware is more expensive than S3 Deep Archive - but it wasn’t until Deep Archive came out.

We put out hundreds of petabytes of bandwidth, and AWS is horribly expensive at first glance, but if you’re at that scale you negotiate private pricing that brings it to spitting distance of using a COLO or Linode/Hetzner/OVH - the distance is small enough that the advantages of AWS outweigh it, and it allows us to run our business at known and predictable margins.

Besides variability (most of our servers are shut nights, weekends and when not required), op ex vs cap ex, spikes in load (100X to 1000X baseline when tickets open), there’s also the advantage of not needing ops engineers and being to handle infrastructure with code. If you have a lot of ops people and don’t need any of the advantages, and you have lots of money lying around that you can use on cap ex, and you have a predictable load pattern, and you’ve done a clear cost benefit analysis to determine that building your own is cheaper, you should totally do that. Doesn’t matter what others are doing.


You're not building everything from scratch on-prem. There are excellent tools for deploying and managing infrastructure like Terraform, Ansible, Puppet that funny enough are used to deploy to cloud as well. Add a self-hosted Kubernetes cluster to that and your on-prem is not that different than a cloud.

As for ops people - you might not need an engineer to replace failed hard-drives, but you'll need a DevOps person to manage Cloud Formation templates and such, and they cost more.


maybe it doesn't need a SV-salary engineer to swap harddrives. Just pick (hehe) three good ones of the AMZ-warehouse people, train him/her and you'll have a very happy employee which is around fulltime (or three of them) and paid well (1/3 of your engineers salary). They'll enjoy working predictable shifts in a nice office and in time you can even train them to do 2nd level support.


We're looking at moving from AWS to having some machine space rented in two data centers. The reason is purely cost.

There are still some computers on site due to equipment being tied to it, telephony stuff, etc.

My last company was looking at "moving to the cloud", with the idea that its data centers were too expensive, but found out that the cloud solutions would be even more expensive, despite possible discounts due to the size. They still invested in it due to some Australia customers wanting data to be located there.


A job or two ago:

Everything on-site for a couple reasons (50 servers). Mainly because as a manufacturing company, machines on the shop floor need to talk to the servers. This brings up issues of security (do you really wan to put a 15 year old CNC machine 'on the internet'?). Also, if our internet connection has issues, we still need to build parts.

The other big part of it is the mindset of management and the existing system, which was built to run locally, does Amazon offer cloud hosted Access and Visual Basic workers?


I haven't personally done detailed cost analysis lately, but if you have systems that regularly operate at 80+% of capacity, I can't see how the operating costs of any cloud operator can be cheaper than operating it yourself. Their whole pricing model is based on being able to over-provision compute and move workloads around within a datacenter. As much as people talk about failing hard drives and other components at scale, failure rates are still low enough that you could operate dozens of systems at full capacity for 3+ years with maybe a handful of legit hardware failures. To rent that same compute from any cloud provider would cost significantly more. The cheapest "Dedicated Host" on AWS will cost you almost $12k over 3 years if you pay for it on-demand, and it's equivalent in specs to something you can buy for ~$2k.

> am I missing something?

I'd want more background behind what you mean by "at least for business". What kind of business? Obviously IaaS providers like Digital Ocean and Linode are are type of business that would not use other clouds. Dropbox and Backblaze as well would probably never use something like S3. And there are legitimate use cases outside of tech that have needs in specific teams for low latency compute, or its otherwise cost and time prohibitive to shuttle terabytes of data to the cloud and back (3D rendering, TV news rooms, etc). If you're talking about general business systems that can be represented by a website or app with a CRUD API, then most of that probably doesn't require on-prem. But that's not the only reason businesses buy servers.


As most have mentioned already: cost

We started out with Emvi [1] on Kubernetes at Google Cloud as it was the "fancy thing to use". I like Kubernetes, but we paid about 250€/month just to run some web servers and two REST APIs. Which is way too much considering that we're still working on the product and pivoting right now, so we don't have a lot of traffic.

We then moved on to use a different cloud provider (Hetzner) and host Kubernetes on VMs. Our costs went down to about 50€ just because of that. And after I got tired managing Kubernetes and all the complexity that comes along with it, we now just use a docker-compose on a single (more powerful) VM, which reduced our cost even futher to about 20€/month and _increased_ performance, as we have less networking overhead.

My recommendation is to start out as simple as possible. Probably just a single server, but keep scaling in mind while developing the system. We can still easily scale Emvi on different hardware and move it around as we like. We still use Google Cloud for backups (together with Hetzners own backup system).

[1] https://emvi.com/


Anything with massive storage and massive compute that doesn't need low latency is a great fit. I still host ~300TiB and ~250 cores at home because the cloud cost would be astronomical. Edit: This is for personal stuff related to internet-wide scan data and domain tracking. See https://github.com/hdm/inetdata


1. Our customers run our software on their own machines for security and data-control reasons. As soon as something's running on someone else's hardware, the data is out of your control. Unless you're going to accept the (often massive) cost of homomorphic encryption, AND have a workload amenable to that, it's a simple fact.

2. Everything we do in house is small enough that the costs of running it on our own machines is far less than the costs of working out how to manage it on a cloud service AND deal with the possibility of that cloud service being unavailable. Simply running a program on a hosted or local server is far far simpler than anything I've seen in the cloud domain, and can easily achieve three nines with next to no effort.

Most things which 'really need' cloud hosting seem to be irrelevant bullshit like Facebook (who run their own infrastructure) or vendor-run workflows layered over distributed systems which don't really need a vendor to function (like GitHub/Git or GMail/email).

I'm trying to think of a counterexample which I'd actually miss if it were to collapse, but failing.


I actually am moving my startups servers from AWS to a home server.

Reasoning:

* We know how much compute is need.

* We know how much the new servers can compute.

* We have the ability to load balance to AWS or Digital Ocean or another service as needed.

* This move provides a 10x speed improvement to our services AND reduces costs by 70%.

For reference, had to call the ISP (AT&T) and they agreed to let me host my current service. It’s relatively low bandwidth, but has high compute requirements.


We operate an X509 PKI with a Trust anchor and CA. Its not impossible to run the Hardware Security Module (HSM) in the cloud but its off the main path. Its more plausible to run it in a D.C. but it invites security threats you don't have, if you run it inside your own perimiter. Of course you also then invite other risks, but its a balancing act and it has to be formally declared in your Certification Practice Statement (CPS)

We also run some registry data which we consider mission critical as a repository. We could run the live state off-prem, but we'd always have to be multi-site to ensure data integrity. We're not a bank, but like a bank or a land and titles administration office, registry implies stewardship in trust. That imposes constraints on "where" and "why".

Take central registry and the HSM/related out of the equation, if I was building from scratch I'd build to pub/sub, event-sourcing, async and in-the-cloud for everything I could.

private cloud. If you don't control your own data and logic, why are you in the loop?


We are a research group at a company with 2000 or so employees. We have a few GPU machines to train models, which are utilized nearly around the clock.

AWS and co’s GPU-enabled servers are exceedingly expensive. Most of the GPU models on those machines are also very old. We pay maybe 1/3 or less to maintain these machines and train models in-house vs paying AWS.

Mind you, we use AWS for plenty of stuff...


I work in R&D for a telecomms/optics company.

All servers are on premises. Not allowed to have a laptop. No access to emails/data outside of the office. No USB drives, printing documents, etc.

Reason? Protect IP. From who? Mostly Huawei.

Good and bad: When I walk out the door... I switch off. The bad is that working from home isn't realy an option. Although they have accommodated somewhat for this pandemic.


We sell hosting to large number of different customers who for whatever reason, mostly legal, are required to keep data within the borders of Denmark. There is no Google, Amazon, Azure or DigitalOcean data centers in Denmark, so cloud isn't an option for them.

Regarding cost, well it depends. We try to help customers to move to cloud hosting if it's cheaper for them. It almost always will be if they take advantage of the features provided by the cloud providers. If you just view for instance AWS as a VMware in the cloud, then we can normally host the virtual machines for you cheaper and provide better service.

You have to realize that many companies aren't developing software that's ready for cloud deployment. You can move it to EC2 instance, but that's not taking advantage of the feature set Amazon provides, and it will be expensive and support may not be what you expect. You can't just call up Amazon and demand that they fix your specific issue.


> The more I learn, the more I believe cloud is the only competitive solution today, even for sensitive industries like banking or medical.

Then learn A LOT more and start with mainframes and their reliability.


I'm one of two IT persons at a food processor, in a small town. Despite living in an area where a majority of IT & Programmers work at the hospital or an insurance company, my boss has run our own networks and servers since the days of Novell, and we continue to run Windows servers on-premise, instead of the cloud. It does lead to interesting situations, like finding out a Dell server shrieking next to you is running max fan speed because the idrac is not connected.

Neither of us have any experience with the cloud, whereas we have a lot of Microsoft experience. We still rely on OEM licenses of Office, because Office 365 would be 3x or more expensive. We have a range of Office 2019, 2016, 2013 OEM, and we get audited by them nearly every year.

We use LastPass, Dropbox and Github, but only the basic features, and LastPass was an addition last year after someone got into our network through a weak username/password.

In our main location, we have three ESX boxes, running several virtual servers, and then we have a physical server for our domain controller, file sharing and DHCP, DNS in other locations. We also switched to a physical server for our new ERP application server, which hasn't yet been rolled out.

Projects like upgrading our ERP version can take months, but we have a local consulting team, with a specialist in our particular ERP solution, as well as a Server and Network specialist, and we also have a very close relationship with our ISP, who provides WAN troubleshooting.

Our IT budget is small relative to our company revenue, so most cloud proposals would raise our costs manyfold. We continue to use more services like Github and Lastpass, and we both have multiple hats.

I'm a developer, in-house app support, Email support, HR systems support, ERP support, PC setup, and I run our Data Synchronization operation and my boss runs EDI. I do a lot of PowerShell and Task Scheduler, but I've got familiar with bash through git bash.


Unreliable internet.

A retail company may decide that the best place to put up a new branch is coincidentally (though there might be a correlation) at the edge of what the available ISPs currently cover. They might have to make a deal to get an ISP to extend their area to where the store is going to be. However, because of lack of competing ISP options on the part of the retailer, and the lack of clients in the retailer's area on the part of the ISP, that service is probably not going to be all that reliable.

Also, that retail company may experience a big rise in sales after a natural disaster occurs, when communications (phone/cell/internet) are mostly down for the area. One tends not to think about stuff like that until it happens at least once.

It's very important for the ERP/POS systems to be as operational as possible even when the internet is down.


Here is a stupid example: Excel vlookups work on a network drive, but not on a cloud service like Dropbox or OneDrive. The absolute path can't resolve if it's used across multiple Excel users. If the users store the file locally, each will have a different path on their computers. Excel stores actual paths. [1]

There is one way around it: Mounting the cloud server as a network drive (some providers do this by default, but OneDrive is not one of them, neither is Dropbox).

I don't know of a way of mounting OneDrive as a virtual drive; I would be interested to know.

It sounds stupid, but the above was a real life scenario.

[1] Only if the files are closed. Excel can change the path if you have the file open, but it can't change it to multiple option across different PCs. But as I have mentioned before, Excel doesn't seem to document all of their more subtle features.


PhD student/researcher. Most of the compute I do are HPC/scientific computing style and run on University or US National Lab machines. We thought about cloud, but the interconnect (MPI-style code) is very important for us and it's not very clear to me what's available there in the cloud.


On the smaller end of the scale, I have a $12K/mo spend with Azure. I decided to go back to Coloc.

For under $50K, I have 4 machines with an aggregate 1TB RAM, 48 cores, 1 pricy GPU, 16TB of fast SSD, 40 TB HHD, and infiniband @ 56GB/sec. Rent on the cabinet is less than $1K/mo. It's going to cost me about $20K in labor to migrate.

So the nominal break-even point is six months but the real kicker is that this is effectively x10-30 the raw power of what I was getting on the cloud. I can offer a quantitatively different set of calculations.

It also simplifies a bunch of stuff: 1. No APIs to read blob data -- just good old files on an ZFS share. 2. No need to optimize memory consumption. 3. No need for docker/k8s/etc to spin up under load. Just have a cluster sitting there.

There are downsides but Coloc beats the cloud for certain problems.


I'm going to buck the trend and say cloud is great. We do cloud, on-prem and colo (16 racks in two different DCs).

Procurement is a nightmare especially when your vendor is having problems with yields (thanks Intel!) and the ability to scale up and scale down without going through hardware procurement process saves us millions of dollars a year.

We avoid the lock-in by running on basic services on multiple cloud providers and building on top of those agnostically.

Spend is in the millions per month between the cloud providers, but the discounts are steep. We're essentially had to build our own global CDN and the costs are better than paying the CDN services and better than running our own hardware & staffing all those locales.

It's a no brainer. We'll continue to operate mixed infrastructure for quite some time as certain things make sense in certain places.


Self-driving company here. We're doing on-prem because we have a clear projection on the amount of data we'll need to injest, store and process using GPUs.

The advantages of running things in a cloud are clear -- and as an infrastructure team we have challenges around managing physical assets at scale, however it's clear with the cost of cloud providers that eventually we would have to pull data into a datacenter to survive at some point.

Co-location costs are fixed, and it's actually easy to make a phenomenal deal now-a-days given the pressure these companies are under.

The real trick of it all is that regardless of running on-prem or in the cloud, we need to run as if everything is cloud native. We run Kubernetes, Docker, and as much as possible automate things to the point that running one of something is the same as running a million of it.


There is not one clear cut answer on this. It depends on what your company values, i.e. cost vs. agility. If you are using the cloud for what it was meant for availability, scalability, elasticity--and those are the things that your org values--its the right fit for you. If on the other hand you value cost then it clearly isn't the right fit.

One other point I'll make, the true value of cloud isn't in IaaS, renting VMs from anyone is relatively expensive compared to the costs of buying a server and maintaining it yourself for a number of years. The true value of the cloud is when can architect your solution to utilize the various services the cloud providers offer, RDS/DynamoDB, CDN, Lambda, API Gateway, etc. so that you can scale quickly when you need to.


One of my favorite things (at least for personal projects) about using the cloud is so-called "platform as a service" systems like Heroku, where I don't have to get down in the weeds, I just push code and the process starts (or restarts).

Is there something like that I could use on my own hardware? I just want to do a fresh Linux install, install this one package, and start pushing code from elsewhere, no other configuration or setup necessary. If it can accept multiple repos, one server process each, all the better. I know things like Docker and Kubernetes exist but what I want is absolute minimal setup and maintenance.

Does such a thing exist?


You're looking for Dokku

Same git push deploys, heroku-compatible buildpacks or Dockerfiles, all on your own hardware, MIT license.


This looks perfect, thank you! I knew there was no way I could've been the first person to think of this


After a big cloud-first initiative, several managers left; leaving implantation to Linux sysadmins now in charge of cloud. Treated cloud as some colo facility, dump all apps in one big project/account, cloud costs spun quickly out-of-control and lots of problems with apps not being segregated from each other. Cloud declared 'too expensive' and 'too insecure', things migrated back on-prem, team now actively seeks to build and staff colo facilities with less than 10ms latency somewhere outside coastal California (Reno, Vegas, PHX) which just isn't gonna happen because physics.


International Traffic in Arms Regulations (ITAR) Compliance - much easier to keep it on site, off-site compliance is costly. Also, cost over time. Better control of performance requirements for certain applications.


We're in the process of moving a greenfield project from AWS to a more traditionally hosted solution.

AWS turned out to be 5-10 times more expensive; what's worse, our developers are spending more then half their time working around braindead AWS design decisions and bugs.

A disaster any way you look at it.

There are good reasons to chose AWS, but they're never technical. (Maybe you don't want to deal with cross-departmental communications, or you can't hire people into a sysadmin role for some reason, maybe you want to hide hosting in operational expenses instead of capital, etc.)


For hobby projects, I own a moderately outdated 1U "pizzabox" installed in a downmarket colocation facility in a nearby major city. Considering the monthly colocation rate and the hardware cost amortized over two years (I will probably not replace it that soon but it's what I've used for planning), this works out to appreciably less than it would cost to run a similar workload on a cloud provider. It costs about the same or possibly a bit less than running the same workload on a downmarket dedi or VPS provider, but it feels more reliable (at least downtime is usually my fault and so under my control) and the specs on the machine are higher than it's easy to get from downmarket VPS operations.

Because my application involves live video transcoding I'm fairly demanding on CPU time, which is something that's hard to get (reliably) from a downmarket VPS operation (even DO or what have you) and costly from a cloud provider. On the other hand, dual 8 core Xeons don't cost very much when they're almost a decade old and they more than handle the job.

There are a few fairly reputable vendors for used servers out there, e.g. Unix Surplus, and they're probably cheaper than you think. I wouldn't trust used equipment with a business-critical workload but honestly it's more reliable than an EC2 instance in terms of lifetime-before-unscheduled-termination, and since I spend my workday doing "cloud-scale" or whatever I have minimal interest in doing it in my off-time, where I prefer to stick to an "old fashioned" approach of keeping my pets fed and groomed.

And, honestly, new equipment is probably cheaper than you think. Dealing with a Dell account rep is a monumental pain but the prices actually aren't that crazy. Last time I purchased over $100k in equipment (in a professional context, my hobbies haven't gotten that far yet) I was able to get a lot for it - and that's well less than burdened cost for one engineer.


Bandwidth costs.

Most dedicated servers come with unmetered bandwidth so not only is it cheap to serve large files but your bandwidth costs won't suddenly explode because of heavy usage or a DDoS attack.


I agree with you OP.

Our company provides on-premise ERP systems to small (we’re talking at most 20 person companies) wholesale distributors.

Pre-COVID, I was pushing for a cloud solution to our product and pivoting our company towards that model. We’re at a hybrid approach when COVID hit.

What ends up happening with an on-premise/hybrid cloud model is we end up doing a lot of the sysadmin/IT support work for our customers just to get our ERP working. This includes getting ahold of static IP addresses (and absolving responsibility), configuring servers/OSes, and several other things along the same vein that’s wholly irrelevant to the actual ERP like inventory management and accounting.

Long story short, these customers of ours end up expecting us to maintain their on-premise server without actually paying for help or being knowledgable about how it all works. We keep pitching them the cloud but they’re not willing to pay us a recurring fee even though it actually saves the headaches of answering the question “who’s responsibility is it to make sure this server keeps running?"

I think a lot of these answers here are dealing with large-scale products and services where the amount of data and capital costs is so massive it makes sense to start hiring your own admins solely to maintain servers. For these small mom-and-pop shops who are looking for automation, the cloud is still the way to go.


Deja vu. I think you totally hit the nail on the head with that last paragraph. On-premise ERP systems probably only make real sense for (non-small) companies that wish to avoid relying on the internet (because e.g. their business strategy requires that freedom) and can hire long-term sysadmins/programmers that can provide support to those systems.


Have you had experience selling on-premise systems? I’m really curious how other companies handle the sysadmin and IT support issue.


Finance space, under 100 people. Most servers are either latency-sensitive or 24/7 fully-loaded HPC. Neither case fits the cloud model. We do use cloud for build/test, though.


Many organizations do have private clouds.

If by cloud you mean a public cloud like Google, Amazon, or Microsoft, then forget about it; not with these companies piping data directly to U.S intelligence.


What's the difference between private cloud and on prem?

Does it just mean you let everyone spin up VMs etc rather than requiring them to go through IT?


Mostly. Cloud is just bin packing compute and data storage. There are many use cases where it’s cheaper for you to host the hardware you cloud on instead of a public cloud provider.


> What's the difference between private cloud and on prem?

Cloud and on-premises are not mutually exclusive, as many "bits I push to the cloud" get on-premises of a ponytailed guy named John. We focus on what people mean[^1], not what they say. Some even make a living saying "chemical-free"[^2] or "digital marketing" publicly.

> Does it just mean you let everyone spin up VMs etc rather than requiring them to go through IT?

This is the on-demand, self-service aspect. What the users get as a result of their demand depends on the need you're trying to fulfill and defines (SaaS, PaaS, IaaS, etc.). Do you want them to be able to interact with an application, deploy/run applications, or get a VM? Here's a resource that summarizes it[^3] and the Wikipedia page gives an overview[^4].

[^1]: https://en.wikipedia.org/wiki/DWIM

[^2]: https://en.wikipedia.org/wiki/Chemical_free

[^3]: https://www.redhat.com/en/topics/cloud-computing/iaas-vs-paa...

[^4]: https://en.wikipedia.org/wiki/Cloud_computing


A private cloud is potentially on prem.

But besides that it's a about how the infrastructure is manages, in the classical way. Or with somethings which allows easy setup of new "containers" and similar like Kubernets. This doesn't necessarily means VMs. It can be more lightweight containers, e.g. based on Linux namespaces. Nor does it means you can sidestep the IT ;=) But normally it mans you can much easier spin up new services or change scaling between them.


I meant public cloud indeed.


Here is how we went about w/ CSPs (AWS, Azure, GC, Oracle). Thoughts welcome

Getting Started --> Definitely go w/ CSPs. No need to worry about infra.

Pre Product Market Fit + Steady Growth --> On Premise, because CSPs might be expensive until you find a consistently profitable business.

Pre Product Market Fit + HyperGrowth --> CSPs since you wont be able to keep up [we never got to this stage]

Product Market Fit w/ Sustainable Good Margins --> CSPs, pay to remove the headache [we never got to this stage]

Side Note: w/ GPUs, CSPs rarely make sense


You've been managing servers for quite some time but you've never considered the security implications of hosting all of your sensitive data on someone else's computer? You say you're not trolling but I genuinely don't see how those two facts are compatible, except if you work in an industry where the data just doesn't matter. If that were the case though, you shouldn't feel compelled to judge what banking or medical institutions are doing.


We use a public cloud, and believe me: we do consider the security implications. We veeery much do.


Quebec power and internet pricing is really competitive. For residential services I pay $0.06/kw + $80/mo for 1Gbps fiber with 1ms ping to 8.8.8.8 (USD).

As a result, I run a power-hungry Dell r610 with 24 cores and 48GB of ram with 20+ services on it for many different aspects of my company. All the critical stuff runs on DigitalOcean / Vultr, but the 20+ non-critical services like demo apps, CI/CD, cron workers, archiving, etc. run for <$200/yr in my closet.


This is also why there are a lot of data centers in the Greater Montreal area, FWIW.


We are small team startup and i am personally annoyed about pricing of a clouds.

I have a 3 smallish VM for build server + managed SQL. It cost 500$/mo. It doesn't make sense. Having my own VMs on ESXi makes everything very different - most of the time this VMs do nothing, but you want to make them performant from time to time, so there are a plenty of resources because all other vms are too mostly IDLE.

In cloud they are billed as if they are 100% loaded all the time.

I am not really satisfied with latencies and insane price for egress traffic. I just can't do backups daily since it could cost whooping 500$/mo just for the traffic. This is just insane, i can't see how it could scale anywhere for B2C market. For B2B it might work really well though since revenue per customer is much higher.

We are not moving to our own DC, but just keep realtime stuff in the cloud, but anything that is not essential is being moved somewhere else. Bonus is that you need off-site backups anyway in the case if cloud vendor will just ban you and delete all your data.

Startups might move fast and iterate, but if you don't have your own servers you always reduce your usage because it could grow fast effectively reducing your delivery capacity.


Background: I consult extensively in the SaaS space and ask people this question all the time in the course of strategy reassessments, transactional diligence, etc.

1. Physical control over data is still a premium for many professional investors. As a hedge fund CIO told me recently when I asked her why she was so anti-cloud migration, "I want our data to live on our servers in our building behind our security guard and I want our lawyer, not AWS's, to read the subpoena if the SEC ever comes for us."

2. There are a lot of niche ERP- and CRM-adjacent platforms out there -- e.g., medical imaging software -- where the best providers are still on-prem focused, so customers in that space are waiting for the software to catch up before they switch.

3. A lot of people still fundamentally don't trust the security of the cloud. And I'd say this distrust isn't of the tinfoil hat, "I don't believe SSL really works" variety that existed a decade ago. Instead it's, "we'd have to transition to a completely different SysAdmin environment and we'd probably fuck up an integration and inadvertently cause some kind of horrendous breach".


Quite frankly, "cloud" is a convenience and elasticity service at a steep premium, with downsides.

Contrary to popular belief, it does not in the slightest save you a sysadmin (most just end up unknowingly giving the task to their developers). And contrary to popular belief, the perf/price ratio is atrocious compared to just buying servers.

For some of the loads I had been doing the math for, I could rent a colo and buy a new beefy server every year with money to spare for the yearly cost of something approximating the performance in AWS...


For our case, the need is very specific as we are working with mobile apps.

Building iOS apps require macOS and even though there are some well-known "Mac hosting" services, none of them are actual cloud services similar DigitalOcean, Azure, AWS, etc.

So it is much less expensive and actually easier to scale and configure to host the Macs onprem.

(Off the record: If it is for internal use only, you can even stick in a few hackintoshes for high performance.)


A company which:

1. Does "security" critical stuff (like affecting the security of people not data if breached)

2. Which besides certain kinds of breaches has lowish requirements to performance and reliability (short outages of a view minutes are not a big problems, even outages of half a day or so can be coped with)

3. Slightly paranoid founders, with a good amount of mistrust into any cloud company.

4. Founders and tech-lead which have experience in some areas but toughly underestimate the (time)cost of managing servers them self, and that _kinda_ hard to do secure by yourself (wrt. long term DDOS and similar).

So was it a good reason? Probably not. But we still went with it.

As a site note while we did not use the could _we didn't physically manage server either_. Instead we had some dedicated hardware in some compute center in Germany which they did trust. So no "physical" managements securing etc. needed and some DDOS and network protection by default. Still we probably could have it easier without losing anything.

On the other side if you have a dedicated server hardware in some trusted "localish" compute center it's not _that_ bad either to manage.


> The more I learn, the more I believe cloud is the only competitive solution today, even for sensitive industries like banking or medical. I honestly fail to see any good reason not to use the cloud anymore, at least for business. Cost-wise, security-wise, whatever-wise. [emphasis mine]

Most people here seem to point out cost and utilization. I would like to offer another perspective: security.

I worked in both of these industries: finance ("banking", not crypto or PG) and medical (within a major hospital network). The security requirement, both from practical and legal perspectives, cannot be understated. In many situations, the data cannot leave an on-prem air-gapped server network, let alone use a cloud service.

It costed us more to have on-prem servers as we need a dedicated real estate and an engineering team to maintain. Moreover, the initial capital expenditure is high -- designing and implementing a proper server room/data center with climate control, power wiring, and compliant fire extinguishing system are not trivial.


It's the people cost, not the hardware.

Hardware is super cheap:

-A 40 slots rack, with gigabit fiber, dual power and a handful of public IP addresses, costs on average 10000€/y.

-A reconditioned server on eBay with 16 cores and 96GB of RAM costs 500€ (never seen them break in 3 years).

-A brand new Dell Poweredge with AMD EPYC 32 core and 64GB of RAM will cost 3000€.

-Storage is super cheap: 500GB of SSD costs 80€ (consumer stuff is super fine as long as you wisely plan between redundancy and careful load) and rotational disks are even cheaper. Never seen a rotational disk break.

Once bought, all of this is yours forever, not for a single month. You can pack very remarkable densities in a rack and have MUCH more infrastructure estate at disposal than you would ever afford on AWS.

The flip side of the coin is that you need operation expertise. If it's always you, then ok (although you won't be doing much more than babysitting the datacenter). Otherwise, if you need to hire a dedicated person, people is the most expensive resource yet and that should definitely be added to the cost of operations.


I did a startup with a co-founder in 1998, before cloud was a thing. We hosted at above.net first, then he.net following above.net's bankruptcy. Both were very good and we never had colo-related problems with either, though he.net was significantly cheaper.

We started with 2 white-box PC's as servers, 2 mirrored RAID1 drives in each. We added a 3rd PC we built ourselves: total nightmare. The motherboard had a bug where, when using both IDE channels, it overwrote the wrong drive. We nearly lost our entire business. Putting both drives on the same IDE channel fixed it, but that's dangerous for RAID1.

A few years in, we needed an upgrade and bought 5 identical SuperMicro 2U servers with hardware RAID1 for around $10K. Those things were beasts: rock solid, fast, and gave us plenty of capacity. We split our services across machines with DNS and the 5 machines were in a local LAN to talk to each other for access to the database server. The machines' serial ports were wired together in a daisy-chain so we could get direct console access to any machine that failed, and we had watchdog cards installed on each so that if one ever locked up, it automagically rebooted itself. When I left in 2005, we were handling 100's of request/s, every page dynamically generated from a search engine or database.

Of course it took effort to set all this up. But the nice thing is, you control and understand _everything_. Some big company doesn't just do things to you, you have no idea what is happening, and they're not talking. And if things do go south, you can very quickly figure out why, because you built it.

The biggest mistake we made was in the first few years, where we used those crappy white-box PCs. Sure, we saved a couple thousand dollars, but we had the money and it was a terrible trade. Night and day difference between those and real servers.


For business logic at a fairly large school the free and open tools that make the cloud so productive get used here a lot. We get to leverage commodity hardware and network infra very effectively for on-premises[1].

You have to have a good recovery plan for when equipment X’s power supply fails but when deploy is all automated, it’s very easy to overcome swapping bare metal, and easy to drill (practice) during off hours.

This makes it much easier to meet regulatory compliance: either statutory or regulations your org has created internally (e.g. financial controls in-org, working with vulnerable people or children, working with controlled substances, working with sensitive intellectual property.)

Simply being able to say you can pull the plug on something and do forensic analysis of the storage on a device is an important thing to say to stakeholders (carers, carers families, pupil parents.)

I’m so grateful to be living in the modern age when “cloud” software exists[2], but I don’t have to be in the cloud to use it.

The downside: you need trained staff and it’s completely inappropriate if you need any kind of bandwidth, power consumption, or to support round the clock business (which we do not because, out here on the long tail, we work in single city so still have things like evenings weekends and holidays for maintenance!)

— [1] Premise vs premises is one of those “oh isn’t the English language awful” distinctions. Premise is the logical foundation for some other theory (“the premises for my laziness is that because the sky is grey it will probably rain so I’m not going to paint the house”) where as the premise_s_ means physical real estate property (“this is my freshly painted house: I welcome you onto the premises”.)

[2] Ansible, Ubiquiti, arm SBCs like raspberry pi, Docker, LXC, IPv6 (making global routing for more tractable, IPv4 for the public and as an endpoint to get on the VPN.)


Cost.

We recently moved one rack to the different DC in the same city and used Digitalocean droplets to avoid downtime. Services running on Linux were migrated without high-availability (e.g. no pgsql replication, no redis cluster, single elasticsearch node...) and we just turned off Windows VMs completely due to licensing issues and no need to have them running at night.

The price of this setup was almost 4x higher than what we pay for colo. Our servers are usualy <5 years old Supermicro, we run on Openstack and Ceph (S3, rbd) and provide VPNaaS to our clients.

AWS/GCP/Azure was out of question due to cost. We considered moving Windows servers to Azure with the same result - the cost of running Windows Server (AD, Terminal, Tomcat) + MS SQL was many times higher than the price of colo per month. It is bizarre that you can buy the server running those VMs approximately every 3 months for the Azure expenses (Xeon Platinum, 512GB RAM).


I have a friend who is an IT manager for a large chain hotel in the Caribbean.

I keep asking him about why they still use on premises equipment and it boils down to:

* Cost for training / transitioning + sunk cost fallacy * Perceived security risk (right or wrong) * IT is mostly invisible and currently "works" with the current arrangement, why change?


There is no cloud solution that lets you ship and forget a system.

Come in 15 years ? It still works. Is that possible with cloud even in short periods, like 2 years ? No.

Will it ever be possible? No.

Thats primary reason for me. I can use cloud only for stuff that are nice but not mandatory to have for service to work, like status page.

Plus, work is more enjoyable then using somebodies else stuff.


We're (Kentik) a SaaS company and one key to having a great public margin is buying and hosting servers. In our case, we use Equinix in the US and EU to host, with Juniper gear for routing/switching, and the customary transit, peering at the edge.

One secondary factor is that we've only monotonically increased, and it's way cheaper to keep 10%-15% overprovisioned than to be on burst price with 50%+ constant load.

But the simplest math is - we have > 100 storage servers that are 2u 26x2tb flash, 256gb RAM, 36 cores. They cost $18k once, which we finance at pretty low interest over 36 months (and really last longer than that). Factor in $200-400/mo to host each depending (I think it's more like $200, but it doesn't matter for the cloud math).

That same server would be many $thousands/month on any cloud we've seen. Probably $4-6k/mo, depending on the type of EBS-ish attached. Or with the dedicated server 'alternate' they are moving to offer (and Oracle sorta launched with).

It'd be cheaper but still > 2x as expensive on Packet, IBM dedicated, OVH, Hetzner, Leaseweb (OVH and Hetzner the cheapest probably).

Three other factors for us:

1) Bandwidth would be outrageous on cloud but probably not as outrageously high as just the servers, given that our outbound is just our SaaS portal/API usage

2) We'd still need a cabinet with router/switch infra to peer with a couple dozen customers that run networks (other SaaS/digital natives and SPs that want to send infrastructure telemetry via direct network interconnect).

3) We've had 5-6 ops folks for 3 of the 6 years, 3-4 for the couple years before that. As we go forward, as we double we'll probably +1. It is my belief that we'd need more people in ops, or at least eng+ops mix, if we used public cloud. But in any case, the amount of time we spend adding and debugging our infra is really really really small, and the benefit of knowing how servers and switching stuff fails is huge to debugging (or, not having to debug).

All that said - we do run private SaaS clusters, and 100% of them are on bare metal, even though we could run on cloud. Once we do the TCO, no one yet has wanted to go cloud for an always-on data-intensive footprint like ours.

Good luck with your journey, whichever way you go!

And happy to discuss more, email in profile


Cloud falls over if you have a lot of data egress. I don't work on those types of workloads (mainly in PCI and a little bit of HIPAA) so I stick to cloud for the sheer convenience factor (it's easier to fix a security group than having to drive to the office and plug in somewhere like I had to earlier today). Dealing with hardware has become a very specific skill set. I have it, but I don't enjoy it, so I don't look for it.

I still have to build physical networks occasionally (ex: we are building a small manufacturing facility in a very specific niche that's required to be onsite for compliance reasons) but the scale is so small that I can get away with a lot of open source components (pfsense netgates are great) and not have to use things that are obnoxious to deal with (if I never have to deploy cisco anything ever again I won't be upset).


I think when the economy is hit hard a lot of companies are going to have to look at what they can do to make sure they remain profitable since investor appetite has changed. This means some companies will have to look at what is being spent on cloud providers. Renting RAM by the hour makes sense if you are optimistic about future revenue growth, but if the market changes and you have to worry about how you can sustain profits while just keeping your current customer base then this makes a lot less sense. The cloud vs on prem argument also really includes all the things in between, e.g. colo, managed servers, VPSes and also better tooling to manage your own VM and container clusters on these, which I think will now get increased attention and competition when people are considering alternatives to the big cloud providers in order to bring down costs.


I picked up a few refurbished dell servers off Ebay for super cheap a while back - and usually use these with Cloudflare Argo tunnels to host various servers. However, since these are just sitting on the floor next to my desk, usually rely on cloud for any applications with high uptime requirements.

Recently though, I've been working on some distributed systems type projects which would allow these servers to be put in different physical locations (and power grids), and still continue to operate as a cohesive whole. This type of technology definitely increases my confidence in them being able to reliably host servers. I wouldn't want to be reliant on the cloud for large scale services though, from my understanding you can get some crazy cost savings by colocating some physical servers (especially for large data storage requirements).


AWS is one of most expensive options, and far not perfect, I can’t understand why people consider it as default choice... To compare - dedicated server on hetzner (core i7, 32gb ram, 20tb network) is cheaper than medium VM on AWS. If the product is growing, cloud cost can quickly become the biggest expense for company. Than it will make sense to spend some time make things run in more cost effective way

I think if you choose cloud hosting that costs about same as renting dedicated server plus settings virtualization by yourself - than it’s a fair choice (can check on https://www.vpsbenchmarks.com/ or similar)

Another sweet configuration is dedicated servers with kubernetes: good user experience for developers, easy to setup and maintain, easy to scale up/down


My employer is a mid-sized university. Cost is the main issue.

Our environment is a mixture of in-house developed apps and COTS. Until recently, our major COTS vendors didn't have cloud solutions. Now they have cloud solutions but they're far too costly for us to afford. So we need to keep them in-house and continue to employ the staff to support it.

Our in-house apps integrate the COTS systems. Our newer apps are mostly in the cloud. But the older ones are in technologies that need to stay where the database server is, which is in our server room for the reason stated in the last paragraph. Rewriting the apps isn't on our radar due to new work coming in.

Historically, outsource vs. in-source seems to ebb and flow. The clear path is usually muddied when new technologies come out to reduce cost on one side or the other.


My SO’s brother works in the studio video recording industry, and is a very IT-savvy guy. We had a long discussion last holiday season about the state of cloud adoption in that industry. He told me (this is obviously secondhand) that most of the movie industry is not only off the cloud, but exclusively working in the realm of colocating humans and data (footage).

This is for many reasons. The one that comes back to me now is that the file sizes are HUGE, because resolution is very high, so bandwidth is a major concern. Editors and colorists need rapid feedback on their work, which demands beefy workstations connected directly with high bandwidth connections to the source files. Doing something like this over a long distance network (even if the storage was free) would be prohibitively expensive, and sometimes literally impossible.

So the write loads are basically the antithesis of what cloud optimized for: “random sequential reads of typical short length, big append only writes”. The big production houses (lucasarts famously) are also incredibly secretive about their source material, and like to use physical access as a proxy for digital access.

It leads to some seemingly strange (to me as a cloud SWE guy) decisions. He pretty much exclusively purchases top of the line equipment (hard drives/ssds), and keeps minimal if any backups for most projects because there simply isn’t any room. It’s a recipe for disastrous data loss, and apparently it’s something that happens quite often to this day. It’s just extremely prohibitively expensive to do version control for movie development.

I don’t know to what extent cloud technologies can solve for this domain. I asked him if Netflix was innovating in this area, since they’re so famously invested in AWS, but he said that they mostly contracted out the production stuff, and only managed the distribution, which makes sense. The contractors don’t touch the cloud at all, for the most part.

Again most of this is secondhand, I’d be curious to hear more details or reports from other people in the movie industry.


We moved all AWS servers to a colo. Saving 80% of the cost.


There are very few niches where the cloud makes sense: Namely, where you are either too small to benefit from a single server and a single entry-level IT guy (think three or four person companies with low need for technical competency), or where you are expecting rapid growth and can't really rationally size out your own hardware for the job (in this case, the cloud is useful initially, leave it later once your scale is more stable).

In every other case, you are paying for the same hardware you could buy yourself, plus the cloud provider's IT staff, plus your own IT staff which you likely need anyways to figure out how to deal with the cloud provider, and then the cloud provider's profit margin, which is sizeable.


Surprised no one mentioned a third option: cheap vps providers like digitalocean or vultr. They've become real contenders to big clouds recently, providing managed databases, storage and k8s. And their bandwidth costs are close to what you'd get in colo.


My company cannot run our infrastructure in the cloud because we do performance/availability monitoring from ~700 PoP’s around the world.

Not running our infrastructure in the cloud is part of our value proposition.

Our customers depend on us to detect and alert them when their services go down. We _have_ to be operational when the cloud providers are not, otherwise we aren’t providing our customer with a valuable service.

Another reason we don’t run in the cloud is because we store a substantial amount of data that is ever increasing. It’s cheaper to run our own SAN in a data center than to store and query it in the cloud.

The final reason is our workloads aren’t elastic. Our CPU’s are never idle. In that type of use case, it’s cheaper to own the hardware.


Many of the customers of my previous employer had hardware on premises, including the customer that I was handling.

It had both compute and storage (netapp). It had two twin sites in two different datacenters. The infra in each site consisted basically in six compute servers (24c/48t, 128gb ram) and netapp storage (two netapp heads per site + disk shelves).

Such hardware has basically paid itself across its seven or eight years of life, and having one of the sites basically in the building meant basically negligible latency.

The workload was mostly fixed, and the user base was relatively small (~1000 concurrent users, using the services almost 24/7).

It really checks all of the boxes, does all it is supposed to do and in a cheap manner.


What do you mean by "servers"? Anything that isn't a client machine, or just customer-facing infra?

We have a pool of 15 build servers for our CI. They run basically 100% cpu during office hours and tranffer terabytes of data every day. They have no real requirements for backup, reliability etc, but they need to be fast. If I run a pricing calculator for hosting those in the cloud it's ridiculous. We are moving source and CI to cloud, but we'll probably keep all the build machines on-prem for the foreseeable future.

For customer facing servers the calculation is completely different. More traffic means more business. Reliability, Scalability and backup is important and so on.


I would like to add, that even for small teams/projects the cost is the reason. We have small business project, with only one server, few hundred customers, with varying traffic (database sync between thick clients, web portal,..). We were considering cloud, but with the features we needed (few different databases, few APIs,..) it would be cca $1000/month (with reasonable response times - could be cheaper but terribly slow). Having our own on premise server, the price is back just after few months and then just the minimal cost of connectivity, energy and occasional maintenance.. it just didn't make any sense for us to choose cloud.


I worked in Switzerland and a reason to use on premise here is Security

Many detail banks, asset management company or high security company refuse to use any public Cloud.

They want to have a strict and traceable list of people who have physical access to their hardware.

This in order to control any risk of dataleak [1].

In practice they use generally on-premise installation. They ren space in a computer center and own there a private cage monitored with multiple cameras. Meaning they know exactly anyone touching their hardware and enforce security clearance for them.

[1]: https://en.m.wikipedia.org/wiki/Swiss_Leaks


My last three employments have had me and my team build three different platforms with three different “providers”. Many lessons learned!

Chronological order:

1. E-commerce, low volume (1000-5000RPM), very high value conversions, highly localized trade.

We built an on-prem stack using hasicorp here. This place had on-prem stuff already in place, the usual vendor driven crap - expensive hypervisor, expensive spoffy SAN, unreliable network. Anyway, my platform team (4-5 guys) built a silo on commodity hardware to run the new version of this site. This is a few years back, but the power you get from cheap hardware these days is astounding. With 6 basic servers, in two DCs, stuffed with off the shelf SSDs we could run the site and dev teams no problem. Much less downtime compared to the expensive hyperconverged blade crap we started on at basically no cost. There’s a simplicity that wins out using actual network cables and 1u boxes... LXC is awesome btw! Using “legacy” vmware, emc, hp etc for non-essential on-prem? Cloud is tempting!

2. Very high volume (Billions of requests per day), global network. AWS. Team tasked with improving on-demand scalability. We implemented kubernetes on AWS and it really showed what it’s about! After 6-7 months of struggle with k8s < 1.12 things turned around when it hit 1.12-1.13-ish we got it to act how we wanted. Sort of, at least. Cloud just a no brainer for this type of round-the-clock, elastic workload. You’d need many millions up-front to even begin building something matching “the cloud” here. Lot of work spent tweaking cost though. At this scale cloud cost management is what you do.

3. Upstart dev-shop. No rpm outside dev (yet). Azure. About 30 devs building cool stuff. Azure sucks as IaaS, they want you to PaaS that’s for sure. Cloud decision had been made already when I joined. Do you need cloud for this? No. Are there benefits? Some. Do they outweigh the cost? Hardly. In the end it will depend on how and where your product drives revenue. We pay for a small local dev datacenter quarterly, which i find annoying.

Just some quick thoughts off the top of my head (on the phone so excuse everything).

Happy to discuss further!


Our software runs in core (mobile) network systems, but that's at our customers. We ourselves have a rack in our office that runs things like Git repos, project management, virtual machines for development / testing, build servers, and instances for trainings.

We're concerned about corporate espionage and infiltration, so we can't trust our servers being out of our sight. Most people don't have the code on their physical machines either; I'm a cocky breath of fresh air in that regard in that I prefer my stuff to run locally instead of the (slow, underpowered) VMs, I trust Apple's encryption a lot.


We have mostly switched to, or are in the midst of switching to, the cloud.

Services that we will continue to run on-premises (as an exception to that rule) are some machine learning training clusters (where we need a constant, high-level amount of GPU and cloud provider pricing for GPU machines is very far off the mark of what you can build/run yourself) and some file shares for our production facilities where very large raster files are created, manipulated, sent to production equipment, and then shortly afterwards deleted.

Most everything else is going to the cloud (including most of our ML deployed model use cases).


this assumption doesn't consider threat models where the vendor could be part of your problem. E.g. if you're based in country A and work on super secret new Tech for an emerging industry, then hosting in country B may not be an option.

Imagine a company in Europe that decides to host it's files on Alibaba Cloud in the US.

Imagine the US Department of State hosting it's files with Google.

Imagine an energy company working on new reactor Tech, ...

Imagine a Certificate Authority which has an offline store of root certificates which need to come online to sync twice a day.

Imagine cases where you need a hardware HSM.

Then there is also Cost as others have pointed out. AWS cost structure is so complex that whole business models[1] have sprung up to help you optimize the price tags or reduce the risks of huge bills. that's right: you need to have a commercial agreement with another partner that has nothing to do with your cloud just to work around aggressive pricing. The guy who started this ~2 years ago has grown to 40+ people (organically), is based in Switzerland and is still hiring even in this recession. It should give you an idea of how broken the cloud is.

Lastly there is also the lock-in. All the hours that you have to sit down and learn how the AWS IAM works is wasted once you decide to move to another cloud. The cost for learning how to use the 3rd party API is incurred by you not the cloud vendor. For people who think lock-in isn't much of a problem remember your whole hiring strategy will be aligned to whatever cloud vendor you're using (look at job description that already filter out based on AWS or GCP experience). Lock-in is so bad that for a business it is close to the rule of real-estate (location, location, location), only that it's to the advantage of the cloud vendor not you as the customer.

[1] optimyze.cloud

[2] "I have just tried to pull the official EC2 price list via JSON API, and the JSON file is 1.3gb" https://twitter.com/halvarflake/status/1258161778770542594


If you have a lot of data but not a lot of users, it's prohibitively expensive to pay monthly hosting and network egress fees when you can buy cheap hard drives, use ZFS, and whatever server protocol you desire.


I notice all the banks I'm working for are moving to the cloud. A few years ago they all had their own data centers, sometimes really big, well-designed custom data centers. But they're all moving to the cloud now.

I've personally been wondering whether that's wise, because financial data and the handling of many banking processes are a bank's core business. It makes sense that a bank should be in control of that. And it needs to obey tons of strict banking data regulations. But apparently modern cloud services are able to provide all of that.


Cloud costs way way more compared to on-prem solution for my company.

We need random access to about 50TB of files, and quite a decent number of VMs.

For storage on-perm vs cloud: buy was cheaper to have after 3(!) months.

For VMs(some of them could be containerized though): 1 year

It was cheaper to buy second-hand decent server, slap SSDs and just install a decent hypervisor. Those costs also include: server room, power usage, admins etc.

We do use cloud backups for the most important stuff.

Cloud is cheaper if your business is a something that is user-based - as in you might need to scale it, hard.

If you aren't doing anything like that it is absurdly expensive.


Can you recommend dedicated hosting provider that: 1. Has US, EU and Asia regions. 2. Let you rent servers per month. 3. Has decent pricing. Not premium, doesn't have to be low cost. I expect excellent egress pricing and 1/2-1/4 cost for CPU compared to the cloud. 4. Reliable network. GCP premium network is very good. How does dedicated providers and VPS providers (Linode and DO) compare? 5. Easy to use management and dashboard. Experienced really bad dashboards and hard to use Java tools to install and manage dedicated servers.


- cost, as well evidenced in other comments here. The hyperscalers are orders of magnitude more expensive than dedicated hosting or using collocation providers.

- lock in, all the hyper scalers want to sell you value add services that make it hard or impossible to move away.

- concentration risk, hyper scale providers are a well understood target for malign actors. It’s true they are better protected than most.

- complexity, if you think about how little time the hyperscalers have been operating in comparison with corporate IT they have created huge technical debt in the race to match features.


Security (we're a bank)


How is the cloud less secure than your on-prem servers? I would argue that it's easier to keep track of all the threats with the tools available from big cloud providers.


Normally it's not. But if certain thinks went wrong you might have much less control to fix thinks.

Like consider how many cases of "help big company randomly blocked our account" popped up on hacker news and twitter the recent years.

Also consider news like: "Amazon spies on seller to get a competitive advantage when selling their own products".

Or e.g. companies short term massively changing their service terms or API's. Deprecating old ones etc. This might be fine for customers which products are always changing anyway. But for banks this isn't attractive at all, they only want changes if they are really needed for them.


Forget the bank for a moment, and imagine if you were in direct competition with Amazon and concerned about insider risks.

Now imagine you're aware that Amazon's retail arm is quite happy to launch competing products based on sales data from marketplace sellers, so it's not like they have strict firewalls protecting competitors who pay Amazon for services.

Now imagine Amazon offers you a bunch of security options that sound suspiciously like snake-oil. Like an encrypted storage service where they store the encryption key and transparently decrypt files when you request them.

Now imagine you know people with hypervisor access can do what they like to your VM and leave no evidence whatsoever. And your on-prem experience is that there will definitely be some people with hypervisor access.

Now imagine you've been through various audit and certification processes yourself, and you don't find yourself reassured by the knowledge Amazon has too.

Now imagine if you were a boss at AWS and you found an insider had violated policies and accessed things they shouldn't have. Would it be better for your career to publicise that widely, or reprimand the guy then hush things up?

Now imagine you've been an ambitious cowboy at certain times in your career, and although you've never accessed a competitor's data to give an edge to something you were working on, you can see how it would be a temptation.

I can understand how a person who believed all that would believe putting their business in their competitors hands would be a naive decision.


> I can understand how a person who believed all that would believe putting their business in their competitors hands would be a naive decision.

That sounds like it's hard to believe all this. But some thinks you state actually had some proves that they do happen...


The cloud provider has physical access to the machines. They may have software access to the machines.

I'm sure they're contracted not to do anything bad, but there is risk of an insider incident at the cloud provider leading to an incident the customer can't prevent.


Encryption?


Doesn't help if the provider snapshots the ram containing the keys.


The server is in someone else's building. They have hippa compliant stuff for healthcare, but at a point there is just that risk that you have to take with the servers and data not physically being entirely under your control and having to trust the providers


The big question is - how much do you trust these providers.


I would guess that approximately zero banks own data centers that are operated by their own employees. There might be a few exceptions to this, but the reality is that most banks don't view technology as part of their core business. So this largely gets outsourced to IT consulting firms like Infosys, IBM, Wipro, etc.

The big question is - how much do you trust these providers, and do you think they are more competent at security than Amazon/Google/Microsoft?


Or formulate it differently:

1. Trust a provider which whole existence relies on trust and which you can audit or at last cross-check the audit and security processes (as a Bank you are normally not a small customer).

2. Trust a provider which might 1st be a potential competitions in some business fields. 2nd is so big that it can easily afford losing you. 3nd for the same reason doesn't allow you any insights into there internal processes. etc.

Plus many of the banks having their own hardware also have their own IT team. So it's often about trusting your own people. I mean either you keep your it or you outsource _and_ go into the cloud. Keeping local servers but outsourcing IT at the same time seems kind not very clever tbh.


Operations is one thing, but security policies setting is another. The cloud provider can for example actively lie to you about having some security measures implemented, while it's not going to happen with your Wipro-like consulting firm, as these people just don't call the shots around the bank and are there typically only to execute procedures written on the bank side.


Systems in the cloud are always going to be less secure than local systems. With cloud you have the same issues with in-house access and data leaking, because you have the same people with access to the same data, PLUS with cloud you also have additional issues with the various third party/parties (over whom you have little control).


Isn’t that 2010s issue? We have both challenger and old-school banks either running 100% in cloud (Monzo - kubernetes, microsevices, linkerd) to hybrid workloads


But there are literally 100's (or more) of banks around the world that use AWS for compute. They meet all the security requirements you could want (27001, SOC2, and many others) around how they secure and limit access to servers by staff and have a variety of WORM audit tools and a very comprehensive IAM to help you satisfy these requirements are being met.


I don't know about the parent comments work place, but I don't think any non-US bank should use a AWS (or GCP or Azur).

It's way to risky that changes in local and/or US regulations will require you to move your whole infrastructure to a new provider like in a month or two (which is completely impossible).

Btw. same applies for any bank relying on the infrastructure of a foreign company.

Like e.g. just think about the whole legal mess around CLOUD act.


I'm European and was surprised to hear that legal issues are indeed a part of the decision, not just technical details.

AFAIK data that passes through the US may be legally intercepted by the US authorities.


I think this is changing. Rapidly. A few years ago no one batted an eye at a financial institution using AWS but now banks seem eager to keep as much data as possible in-house. I’ve recently had two prospective clients (I now work in compliance SaaS) ask whether I use a cloud service provider as part of their vendor due diligence.

Many banks seem to be aiming for the minimum viable external data processing.


We do a hybrid approach which I think makes sense for a lot of companies. Our mission critical stuff runs in the cloud, but anything that has to do with staging environments and development we do on-premises. It's pretty easy to host yourself, if you have a couple of decent engineers looking after it (depending on scope, of course!).

Redundant power, redundant internet connections, and a few racks of Dell servers and gigabit switches. Why did I mention Dell? They just don't seem to die. We used HP for a few years but had a few bad experiences.


Cost is the sole reason for us as well. We have ~600-700 dedicated servers around the globe, and generate a lot of egress traffic (~20PB/mo). We last ran the figures a year or so ago, and it'd cost us around 13-15x in network costs alone.

A common thread of a lot of the replies to this post is network traffic costs. If one of the cloud providers can figure out a way to dramatically (and I mean at least 10x) reduce their network transfer pricing, then I think we'll see a second wave of companies adopting their services.


It is so so so much cheaper. We moved to AWS and tried setting up our servers with specs that were comparable to what our physical servers had been. We just about died after the first month. Our bill was higher for one month than for multiple years running our physical servers. We had to way dial them back to get the run rate down to a reasonable number. Now we have serious buyers remorse. Performance is terrible. The cost is still more per month than we ever had with our physical servers by a large amount.


We do not use the cloud. We operate (24/7) facilities in remote locations where we do not have super reliable internet connection (we do have redundant links including three different fibres on distinct paths plus radio links but still). For this reason alone nothing critical can be in the cloud. In our experience, however, cloud offerings are not that cheap compared to purchasing and operating your own machines. Besides, one still needs sysadmins even when operating infrastructure in the cloud.


On-prem can make sense where your computing needs are constant and predictable. You can arrange to have exactly what you need, and the total cost of buying and operating it may be less than it would be to get comparable resources in the cloud.

If your computing needs vary over time, then provisioning on-prem for peak load will mean that some of your resources will be idle at non-peak times. It may be cheaper to use cloud resources in cases like these, since you only need to pay for the extra capacity when you need it.


We are software development company. Most of our compute/storage needs are for build/test cycles. We recently bought ~100K worth of additional hardware to migrate some of that work off AWS. The storage / virtualization is done using Ceph & OpenNebula. Including colocation/electricity/networking costs, the investment will pay for itself in ~9 months. If I would include deployment costs & work to migrate the jobs off the AWS -- it will pay for itself in 11 months.


PII and draconian security policies. We are not a tech company, so we can't fine-tune or have nuanced policies, we just have to build a wall around everything . In our web password recovery process, we can't tell people if their login was correct or not, because that might help a brute force attacker infer they got that right. Even though we have that rate limited anyway. I don't know why we can't just tell people the login was found or not found like banks etc.


What industry? It's good practice to not share information like that in any context, since attackers that have a bank if email addresses that are trying to figure out which if a few reused passwords might be used for a given website would have a harder time if they dont even know if the email has an account with a given website, or if an alternate email is used instead etc.


Not a company but a personal project:

https://github.com/krisk84/retinanet_for_redaction_with_deep...

I haven't analyzed the TCO yet but the bandwidth costs alone from hosting my curated model of 100GB in the cloud (Azure blob) have greatly exceeded my total operation spend from downloading the entire dataset and running training. By an order of magnitude.


Because Elon doesn't believe in the cloud. We were one of the few teams that got AWS access, and we were told not to rely on it too much because it's temporary...


Just talking about my side project here: a local server + ngrok is easier to use and cheaper than anything in the cloud.

In general, I would say any noncritical system I would host on-prem.


Cost. We're a small company, five person team and we need a development environment. All our stuff is built around VM and Docker. So scores of little test nodes that get run, or beta environment in the cloud was costly (>$300/mo base, sometimes 3x). For $1000 we put a box in the office we all VPN to that runs all the necessary VM and Docker. The variable, sometimes expensive cost for Test/Beta in cloud was replaced with low fixed cost.


The smartest approach is to be able to run anywhere, which is increasingly practical due to VMs, Docker etc.

(At least funded) startups should start with the cloud as speed to completion is key, but can later optimize for cost.

Elasticity of the cloud is also great, dealing with peak demands dynamically without having to purchase hardware.

I'd suggest larger companies to use at leas two cloud vendors to add resilience (when MS Teams went down, so did Slack - I was told they both use MSFT's cloud).


Am the CTO of a small company making payroll software. We don't have on premise servers but we currently are moving away from azure to renting VPSs from a local provider.

The cost benefits are huge and since our app is mostly a normal web app we dont need that many fancy cloud things. And i dont see us needing it in the future.

I really dont understand why a company doing similar things would wanna go the could route. Its so damn expensive and its not always easy to use and setup.


Sure, for tiny intel servers it makes sense to rent vms. It won't start to hurt until 1.5 years later at which point the project needs to become profitable anyways.

I run a couple of on-premise xeon gold machines with 96 gb ram and 40+ cores on each. Their total purchase cost was the monthly cost of renting them on the cloud. Also, you will never get the full benefit of the servers you use unless they are dedicated instances with no virtualization layer.


It depends on what you have to do. If your stack includes a series of microservices/monoliths connected to the typical OLTP DB then you might very well sit entirely on cloud. Things change when you need heavy lifting like having big ElasticSearch or ClickHouse clusters, or any other operation that requires heavy CPU and high RAM capacity. In that case using providers like Hetzner can cost you 1/10 of the bill compared to AWS.


Manufacturing plants across the world controlling production lines: no way to go in the cloud for that. We put the reporting data in the cloud, no problem there.


I work for a US Stock Exchange, and some of the technologies that we rely on are not permitted in the cloud. The performance metrics we need are usually only achievable on highly tuned bare metal deployments as well so cloud is usually not an option. I guess it really depends on your workload, but I think there is a very healthy amount of production being deployed and worked on businesses own datacenter/private clouds.


"Server" is such a broad term. In our case, aside from cost as others already mentioned, the distance and latency is very important. The servers must be located as close to client devices as possible and reasonable, and they are synced with clients using PTP to microseconds (depends on the actual distance and topology). Cloud is a no go, and we a using bare metal K8S for graceful handling of failures and updates.


Most of my equipment has physical interfaces, video and audio in and out

Some equipment is very latency sensitive -- I'm talking microseconds, not milliseconds.

More generic tasks need easy access to that specialist equipment (much of which doesn't quite grasp the concept of security)

Given that we therefore have to run a secure network across hundreds of sites on multiple continents, adding a couple of machines running xen adds very little to the overhead.


Amazon/Azure/GCP - they're _businesses_. They charge you 50-70 points of margin in order to run your computers. If you're R&D-limited, that's not important to you, but if you're a more mature company that's cost-limited then it matters a lot. If it's a core part of your business, it'll never be cheaper to pay another company to profit from you. Period.


Well, for one thing: jobs. It takes a lot of IT man-power to manage an on-premise solution, especially when you run everything on-premise. Just imagine if the CTO were to switch the company to a cloud based solutions, it would save the company millions of dollars but also it would mean cutting a lot of jobs. Gov departments that use on-premise do so for security reasons and to maintain existing jobs.


We are going to the cloud, but you have to be careful. With on-prem the limit of the cost is the server. Someone writes inefficient code and it just doesn’t work. In the cloud there are 1000 ways to overspend and the vendors purposefully don’t make it easy to track or keep things under control.

It’s kind of like outsourcing. If you don’t know what you are doing, cost goes up and quality goes down.


Yes, some of the leadership thinks that they can build a better cloud than MS or AWS. It is pretty hilarious to watch how spectacularly they fail.

https://forrestbrazeal.com/2020/01/05/code-wise-cloud-foolis...


Cloud is only good if you don't care about costs and plan to scale without looking back.

For example, building a Ceph based Software Defined Storage with croit.io for S3 comes for 1/10 to 1/5 of the AWS price in TCO. Same goes for any other product in the cloud.

If you only need it a short time up to 6 months, go to the cloud. If you plan to have the resources longer than 6 months go to Colocation.


Cloud will continue to dominate, even if it's more expensive. Why? Because the best companies are able to focus on what makes them special, and outsource the rest.

Cost and security are important, but they may not be most important. In a business, the scarcest commodity is FOCUS. By outsourcing anything that isn't core to your product, you can excel at what differentiates you.


On top of what others have said, when outside the US the Cloud Act has been a big one for most previous companies I worked for.

Using AWS / Azure / Google Cloud (even using datacenters from your own country) implies that the US government can access your data at will.

As soon as you treat sensitive information, especially related to non-US governments, this becomes a blocking factor.


In some industries, cloud is not an option. For example, certain privacy laws like HIPAA preclude uploading data to third parties, in which case, you need things to be on-prem. There are also a lot of places in the world where internet access is limited. Sometimes you need to solve problems beyond the simple "web saas in cloud" use case


Own servers cheaper than cloud. We calculated two years ago, it was up to 4x for HPC. Best cost-wise option is used servers.


It all depends. Outside of SaaS, If you have a mature data center operating model and truly understand your costs, there won’t be a strong cost savings story for many types of workflows.

If you suck and don’t understand costs, or don’t automate, or spend a lot of time eating steak with your Cisco team, you’ll save money... at first.


Legacy seems to be missing from the comments. Before the advent of Cloud IaaS (Infrastructure as a Service) a very large ecosystem of On-Premise hardware and software flourished. The question needs to be considered in the context of greenfield vs brownfield systems as the trade-offs involved differ drastically.


Cloud servers tend to hide their limitations behind payment tiers which makes it hard to really know how far you can push things. Also there are various turns, conditions, cache rules, change management strategies that are hidden when dealing with someone else's constantly changing box of magic.


We install in to remote locations - you can't access the cloud if your connectivity is down so local resources is a hard must.

Though we have adopted something close to an "Edge Computing" solution... I guess it comes down to "Why not both?" :)

I think it also depends on your definition of "server"


Just want to say I love threads like these. I hope one day where I work we can have two or three obscenely beefy servers and be done with it. I'm planning for something similar probably Q2 2021 as our expenses grow too large on a manged hosting platform like Heroku/Render/Aptible.


There's a fake believe in my company that all the data must be on premise because of privacy concerns.

We've never used cloud services and we do not want to use it.

Some are saying it's a matter of costs, but you know? For a dual node server (hot standby) we were asked 120K € + 50K € only for configuration fees.


There are some things you just can't do in a VM.

The company I work for actually develops and hosts an AWS clone for the Linux Foundation, but with very specific requirements. They have special needs that requires baremetal machines and "real" networking between them across 6+ NICs per server.


Not many services are available in gov cloud regions, so we're stuck with on-prem nearly everything.


Zseries are a huge deal to move to the cloud; it’s just not worth the risk for most organizations.


my 2cents: the 'big' clouds (AWS, GCP, Azure) and the big-brand clouds (Oracle, IBM, etc.) are attractive for BigCustomerCo because:

1. Replace capital expenditure of in-house infrastructure + staff with OpEx that can be dialled down

2. Get to benefit from the economies of scale that the cloud vendors get (those Intel CPUs get a lot cheaper when purchased in the 1000s)

3. Get to leverage big shiny new tech like Big Data and AI that's 'advertised' as 'out-of-the-box'

My only concern really is that the big cloud players are all fighting for dominance and market share. What happens in the next 5-10 years time when they start raising prices? Different kind of lock-in - customers won't have the expertise in-house to migrate stuff back.


Yes. Cost of GPUs if you want them on regularly and reliably, and you don't need 2000 of them. We run small/mid-sized operations on-demand and the latency of spinning up instances is not competitive and the cost is outrageous to have them on standby.


There is extra complexity with managing cloud based solutions. Logging in, setting up ssh keys. Ok it's all automatable but if I want a basic server set up for doing a simple task quickly it's often a lot easier to run it on an internal server.


I think it depends on what is the alternative and what hardware you are buying. If you go for top of the line HPE or dell or Cisco hyperconverged stuff that allow you to be, sort of, your own cloud, you will end up with the same large bill.


If your product doesn't run connected to the internet, it is difficult to make the case for cloud development. You need developers who understand hardware, and abstraction layer cloud provides is a handicap instead of an enabler.


Hybrid cloud is the moneymaker solution, but there are no out of box solutions for it.


I work for a German company. They run their own DC (unfortunately, I'm not privy to precise numbers I could share, but we must be in the 1000s of hardware servers).

Why? Because our (German) clients don't thrust US cloud providers.


Gigantic Elastic Search cluster - according to Elastic the largest in Europe (as of 2 years ago) - used in production. Broke again and again on AWS. We needed more shards than AWS supports. Moved to bare metal again.


When cloud computing started, alternative software deployment was complicated, after a decade, much has been improved. So ease of management is also one of the factor. Not all need state of the art data and AI.


We use on-prem GPU nodes for training deep learning models. Our group estimated the cost vs. cloud and it was significantly cheaper to go on-prem. I can't speak to security-wise and whatever-wise though :)


We colocate and as a small tech business, I can't imagine doing anything else. We don't spend more on payroll due to colocating and AWS/etc would easily double our annual non-payroll expenses.


One good reason for your list: diversification. You don't want all the banking systems of a country all running on AWS. It's an unacceptable single point of failure risk for a country.


Like many said, cost.

But also - legalities. Most cloud providers have very unclear rules and what exactly happen should you be in breach. For this reason, our business prefers to have most of the control.


How do you manage installation and upgrades of dedicated servers? What do you use for block storage and object storage? Kubernetes and Ceph seems to be hard to setup and maintain.


to control the costs. after the original purchase, the MRC is a fixed cost for hosting and for the network access. also, for the total control - the network, io, cpu performance does not change randomly. with better predictability our IT team can give more precise SLA.

we're not 100% on-prem, but aws, gcloud and azure are the worst examples of 3rd party hosting - unpredictable and with complicated billing. we're considering the alternatives to big 3 for the "cloud hosting"


There are several levels, not just on-prem vs cloud. You can for example co-locate, rent dedicated servers, rent a single VPS, or put your website up on a web hosting provider...


I work in a very large financial institution in small tech team. I would give an arm and a leg to not deal with soul sucking bureaucracy that is our IT department.


This thread has been illuminating, a lot more non-cloud people than I thought! Drinking the cloud kool-aid had me thinking cloud was the only realistic way to go.


My client's customers are geolocated (continental US) and their personal data is sensitive. So their server is in their own firewalled server closet.


Client data confidentiality. I know it's a weak argument, but if the contract requires that we store data in-house, there's no choice.


It’s not secure if you have air gap requirements or issues with employees from another company technically having access to all of your data.


I normally host on prem and also rent dedicated servers elsewhere as standby. Way cheaper than the cloud and full control the way I want.


We run one of the largest Hadoop clusters on the planet. On-prem is very cost efficient if you're running jobs flat out 24/7


Confidentiality. Protecting sensible data (avoiding letting some hostile obtain or even modify them) seems impossible on the cloud.


TCO.


Which, all things considered, seems lower on the cloud. Could you give more details on this answer?


AWS EC2 m5d.2xlarge (8 vCPU / 32 GB / 300GB nVME SSD) for a 3 year term is $4,466.

That's enough for 3 HP ML10s equivalently configured plus the power to run them for their entire lifetime, which is well over 3 years.

You are paying out the nose for the fancy-pants deployment and administrative features that the cloud gives you, and for a lot of use cases, that isn't needed.


You're forgetting the human part in this cost. Installing, securing and maintaining the machines costs as well.


I'm not; for non-production compute workloads, the cost is negligible.

Even for production workloads, the 4x premium over base hardware cost is absurd.


Totally disagree with that. Humans are expensive. Cabling, configuring, training, ordering and replacing hardware, that all takes time. Over a period of 3 years, there's a high chance that you'll end up having, depending on your luck, at least a few dozen days used up by those tasks. Assuming you're not paying peanuts for your contractors, this will easily get to the $10k or over.


If you disagree, give us your example TCO estimates for all of that vs cloud. I'm sure HN readers would be happy to dissect it.


Unfortunately I don't have exact numbers, having never been responsible for the budget of those. This is only an estimation considering the pricing I see from the cloud providers, and the rates I know from my fellow contractors.

Besides, I'm not exactly trying to win a war here. Simply trying to know under which circumstances it is better to go for on-prem vs public cloud.


I too think there is no one winner here. There are some cases where going with an IAAS makes a ton of sense. From the top of my mind,

1. Scale up/down - you can say turn off servers off season/nights and spin up only when needed.

2. Your business is not big enough to hire a person for Ops.

3. You don’t have the space for a server room/ unwilling to take the burden of dealing with a co-location provider and are willing to pay through the nose for that choice.

4. Your devs are okay with doing Ops (managing server resources on AWS is still Ops. You still need sense of firewalls, DNS, backups, updates, change management etc)

5. Ownership vs Operation : if a business is not sure they will sustain that amount of resource needs long enough to buy hardware. Buying reserved instances on Amazon comes to mind as opportunities that could have been better as a co located servers.

6. Geography - At times you need geographical spread (CDN, or multiple office/customer locations)

IMO, unless you can find yourself one such justification for the extra costs, you should consider self hosted or atleast collocated servers. At least this is what I tell my friends in the business who seek advice on infrastructure.


If you already have a IT team which needs to handle 100+ Linux&Windows desktops systems, 40+ Laptops, WIFI network switches and firewall for all this as well as a few conference rooms and a few servers for local storage/backups(1) and similar. Then adding a small number servers isn't changing much in the workload for the IT. ;=)

(1): Available even if internet is down.

----

Also frequently doing most scientific computing tasks are _much_ cheaper on-premise (if you have space for it). Especially given that some have setups like multiple terabyte of RAM.


Your admins need to known and maintain the systems anyway, no matter if they're in AWS or on premise. Sure, with AWS they might have a bit less work, but that would not necessarily reduce costs, a smallish company still need the same number of admins to cover the shifts and replacements so a switch to cloud would not allow to reduce any employees.


> You're forgetting the human part in this cost. Installing, securing and maintaining the machines costs as well.

I'm not sure where to even to start with this. It's not even wrong ...

For hosted and colo servers, you can have "remote hands" unbox, rack and cable your servers. My colo facility, he.net in Calif., does that for free, including IPMI configuration.

AWS does not secure your servers. They have a "shared security responsibility" cutout in all their contractual language. So they have private networks, but you're responsible for iptables, application security, etc.

Their AMIs aren't even encrypted, ffs. That's right, 99% of AWS customers don't do disk encryption, even if they're required to do so for compliance reasons, because you have to do an extra step yourself to encrypt it with your KMS. (GCP does encrypt.)


A 2u asus with 2 xeon silver, 64gb ram and 4 x 2080ti is maybe us$15k?

We'll use it for as long as its producing useful output. Lets say 5 years? Probably a little longer?

A 60bay 4u western digital with 14tb drives is under us$50k?

definitely got a dell md1280+1u server with 70x10tb 2yrs ago for under us$50k. Fully populated the following year..

A 2u dell with maybe 20 cores n 128gb ram each should cost less than us$10k.

And we just got 4 or 5 dell switches with 48x10g ports for 50-70k? I'm not sure.

What's the equivalent in the cloud?


Are you keeping tabs on the cost of the building and associated security? Video cameras, locks... Also the manpower needed to install and maintain that equipment.

Similarly, your 70*10tb tells nothing. What's the redundancy on that, how much of that space did you lose to it, where do you store your backups and at what cost?

As for networking, having those switches is nice, but you still need your internet connection if you're serving anything online from this.


The manpower to run that stuff is already doing desktop support.

Storage is usually a stripe of 10 disks in raidz2 + 2 spares. Then we do a nightly zfs send..

The internet connection for the hosting is 100mbps. The users share 1gbps of internet.

We pay only for hardware. Rent and utilities. And the internet connections. Everything else we can do it ourselves.

I mean. Seriously. Whats the cost of running 100 x 4 x 2080ti in the cloud for a year? Or storing 500tb(1 instance only,no redundancy) ?


So service contracts on any of those?


What service contracts? Only hardware warranties n extended hardware warranties..



Dropbox has outgrown AWS. I’m not sure it’s a good example.


Do you have a source on this?

By any reasonable metric, onprem is much much cheaper than the cloud.


We are staying on-premise for security and privacy reasons.

We can't store our data outside of the company (or even worse: outside of the EU)


If your customers still want the system to run air-gapped from the internet, cloud is basically off the table.


I work in a data science company with 30+ engineers. We've spent 80k dollars on GKE last month only..


Yes just moved from 4 servers in AWS costing 700 usd/month to a single dedicated one that costs 40.


Just out of curiosity, what downtime is in acceptable range for that single server?


At night noone cares


Cloud is even more muddled as a term in this space than usual (Hybrid/private added to the mix).


At least a few years ago, tons of company were afraid of the 'cloud'.

It does change right now.


Big public clouds have a genuine purpose but there is a shit-ton of marketing and FUD being thrown around about them -- I'd bet my hat that's what this post is, given the phrasing up top.

I'm not a fan. In short:

Cost. CapEx and depreciation vs. OpEx. The numbers look amazing for ~3 years until the credits and discounts wear off. Then it's just high OpEx costs forever. Meanwhile I can depreciate my $10k server over time and get some cash back in taxes; plus it's paid for after a couple years -- $0 OpEx outside of licenses, and CentOS has no license cost.

Once you have significant presence in someone's cloud, they're not going to just lower costs either -- they've got you now. What in American Capitalism circa 2020 makes you think they won't find a way to nickle and dime you to death?

It's not going to reduce headcount, either. Instead of 14 devops/sysadmins, now I have 14 cloud admins, sitting pretty with their Azure or GCP certs. Automation is what's going to reduce those headcounts and costs, and Ansible+Jenkins+Kubernetes works fine just with VMware, Docker, and Cisco on-prem.

Trust. The Google Cloud just had a 12-hour outage -- I first read about it here on HN. AWS and Azure have had plenty of outages too... usually they're just not as open as Google is about it. You also have to trust that they won't get back-doored like what happened to NordVPN's providers, and that they're not secretly MITM'ing everything or dup-ing your data. We (and some of our clients) compete with some of the cloud providers companies and their subsidiaries, and we know for a fact that they will investigate and siphon any data that could give them an advantage.

Purpose. We just don't need hyper-scalable architecture. We've got a (mostly) fixed number of users in a fixed number of locations, with needs that are fairly easy to estimate / build for. Outside of a handful of sales & financial processing purposes, we will never scale up or down in any dramatic fashion. And for the one-off cases, we can either make it work with VMware, or outsource it to the software provider's SaaS cloud offering.

If we were doing e-commerce -- absolutely. Some sort of android app? Sure, AWS or Azure would be great. But it's a lot of risk and cost with no benefit for the Enterprise orgs than can afford their own stuff.


Bandwidth is the problem. You can't run a Youtube on any cloud provider.


Security. And Independency.


I work for AWS, so (in the most pedantic way possible) technically yes


Yes, but we work in financial information so very weird requirements.


We use servers on the cloud (IaaS) but still self-host all our apps.


Local caching recursive DNS servers work best close to the clients.


In physics, a gallon of water weights 8.34 lbs. (For this analogy, a gallon of water is a unit of work.) And the gallon of water weighs 8.34 lbs irregardless if it is sitting on my desk in a physical building, or on your desk, in the cloud. Same weight, same unit of work. same effort. For a brand new, greenfield application, the cloud is a no brainer. I agree 100%. But for legacy applications, and there are so, soooo many, the cloud is just some one else's computer. Yes, the cloud is more scaleable, yes, the cloud is more manageable, and yes, you can control the cpu/storage/memory/network in much finer amounts. But legacy applications are very complicated. They have long tails, interconnections to other applications that cannot immediately be migrated to the cloud. I have migrated clients off of the cloud, back to on premise or to (co-lo) local hosting, because without rewriting the legacy application, the cloud costs are simply too great.

The essence of IT is to apply technology to solve a business problem. Otherwise, why would the business spend the money? The IT solution might be crazy/stupid/complex but if it works, many business simply adopt it and move on. Now, move that crazy/stupid/complex process to the cloud and surprise, it is very, very expensive. So, yes, the cloud is better, but only for some things. And until legacy applications are rewritten on-premise will exist.

One final insight. The cloud costs more. It has been engineered to be so, both from a profitability standpoint(Amazon is a for profit company) but also because the cloud has decomposed the infrastructure of IT into functional subcomponents, each of which cost money. When I was younger, the challenge for IT was explaining to management, the ROI of new servers, expanded networking, additional technology. We never quite got it right and often had it completely wrong. That was because we lacked the ability to account for/track and manage the actual costs of an on-premise operation. Accounting had one view, operations had another view and management had no idea really, why they were spending millions a year and could not get their business goals accomplished. The cloud changed all of that. You can do almost anything in the cloud, for a price. And I will humbly submit, that the cost of the cloud - minus the aforementioned profitability, is what on-premise organizations should have been spending all along. Anyone reading this and who has spent time in a legacy environment, knows that it is basically a futile exercise of keeping the plates spinning. On-premise failed because it could not get management to understand the value on in-house IT.

As I said, the costs are the same. A gallon of water weighs what it weighs regardless of location. It will be interesting to see, I predict the pendulum will swing back.


Because we don't trust either amazon or google.


Yes online.net and the like companies save costs


Cost-wise is still a pretty compelling argument.


Annual costs, backups, security and latency.


on-premises not on-premise


We are a 1000-2000 person company and we have probably on the order of $100M of servers and data centers and whatnot, and I think we spend about 2/3rds of that every year on power/maintenance/rent/upgrades/etc.

We don't generally trust cloud providers to meet our requirements for:

* uptime (network and machine - both because we are good at reliability [and we're willing to spend extra on it] and because we have lots of fancy redundant infrastructure that we can't rely on from cloud companies)

* latency (this is a big one)

* security, to some degree

* if something crazy is happening, that's when we need hardware, and that's when hardware is hard to get. Consider how Azure was running out of space during the last few months. It would have cost us an insane amount of money if we couldn't grow our data centers during Corona! We probably have at least 20-30% free hot capacity in our datacenters, so we can grow quickly.

We also have a number of machines with specs that would be hard to get e.g. on AWS.

We have some machines on external cloud services, but probably less than 1% of our deployed boxes.

We move a lot of bandwidth internally (tens of terabytes a day at least, hundreds some days), and I'm not sure we could do that cheaply on AWS (maybe you could).

We do use <insert big cloud provider> for backup, but that's the only thing we've thought it was economical to really use them for.


What kind of business is that? That level of spending seems insane based on headcount. At least without knowing anything about the business. Are you running public facing services or something? What does that look like as a percentage of revenue?


Infrastructure. Datacenter costs are a relatively high percentage of revenue. No public facing services - only large clients.


It sounds like you are operating a cloud.....


Sure... but at least they're not using the cloud! Just a whole bunch of servers!


There is no cloud, its just someone else’s computer :)


Working at a medium-sized bank the data center costs were very significant.


How many employees do medium sized banks have?


This one had 600 on-site employees serving 12 million customers. Not sure how many were off-site.


Hundreds of terabytes a day is really not that much, depends on what latency can you accept. I often run computations over datasets that are petabytes in size, just for my own needs. A big data move would be at least tens of petabytes or more like hundreds, or thousands.

Also surprised about latency, latency from what to what? Big cloud providers have excellent globally spanning networks. Long distance networking is crazy expensive, though, compared to the peanuts it costs to transfer data within a data center.

Reliability - again, not sure I buy it. Reliability is "solved" at low levels (such as data storage), most failures occur directly at service levels, regardless of whether you have the service in house or in the cloud.

The rest of your points make sense.


> Hundreds of terabytes a day is really not that much

How much would it cost to move this across boxes in EC2? I actually don't know, that's not a rhetorical question. A lot of our servers have 10-40gbit links that we saturate for minutes/hours at a time, which I suspect would be expensive without the kind of topology optimization we do in our datacenters.

> Also surprised about latency

We've spent a surprising amount of money reducing latency :) We're not a high frequency trading firm or anything, but an extra 1ms (say) between datacenters is generally bad for us and measurably reduces performance of some systems.

> Reliability is "solved" at low levels

To whatever extent this may be true, it's certainly not true for cloud providers. One obvious example is that EC2 has "scheduled maintenance events" where they force you to reboot your box. This would cost us a lot of money (mostly in dev time, to work around it).

Also, multi-second network dropouts in big cloud datacenters are not uncommon (in my limited experience), but that would be really bad for us. We have millisecond-scale failover with 2x or 3x redundancy on important systems.


> How much would it cost to move this across boxes in EC2? I actually don't know, that's not a rhetorical question.

Data transfer between instances in the same AZ is free. If the data crosses AZs, you're changed $0.01 per GB in both directions. This is for instances on a VPC. I think the pricing model is different for classic EC2.

There are some exceptions like all traffic between EC2 and ALBs being free.

Edit: Pricing is described at https://aws.amazon.com/ec2/pricing/on-demand/#Data_Transfer


and if you go for resilience you will do a lot of that cross AZ talking, unless you also design your apps to know the infrastructure and consider that when communicating (most companies don't do it)


> A lot of our servers have 10-40gbit links that we saturate for minutes/hours at a time, which I suspect would be expensive without the kind of topology optimization we do in our datacenters.

I think everyone does something of this sort nowadays, that's why networking is ~free within data centers :)

> but an extra 1ms (say) between datacenters is generally bad for us and measurably reduces performance of some systems

That speaks to me. You will always be just the n-th client unless you own the cross-datacenter data links (i.e. have full autonomy on deciding the priority of the traffic). It's similar to the covid provisioning problems you had mentioned.

> One obvious example is that EC2 has "scheduled maintenance events" where they force you to reboot your box.

Yeah like others pointed out - that's just what "cloud" is, and is generally a good idea. You're supposed to handle a certain % of your machines going dark without a warning without violating any SLO (or even worse, certain % of your machines "pretending" they're up but actually being ridiculously slow for this or that reason; and don't even get me started on CPU/RAM bitflips).

It sounds to me that you run an extremely highly sensitive service, something for which paying for true ownership of the hardware just makes sense to remove those kinds of risks that most services don't care about. At the end of the day "cloud" is a shared resource, and no resource separation efforts will be 100% effective.


> How much would it cost to move this across boxes in EC2?

Nothing. You generally only pay for data going out of cloud providers. Not data going in or data being transferred within the same region.

> One obvious example is that EC2 has "scheduled maintenance events" where they force you to reboot your box. This would cost us a lot of money (mostly in dev time, to work around it).

You're not going to have a successful cloud experience unless you build your applications in a cloud suitable way. This means not all legacy applications are a good fit for the public cloud. Most companies really embracing the cloud are mitigating those risks by distributing workloads across multiple instances so you don't care if any one needs to be restarted, especially within a planned window.

> Also, multi-second network dropouts in big cloud datacenters are not uncommon (in my limited experience), but that would be really bad for us. We have millisecond-scale failover with 2x or 3x redundancy on important systems.

Are these inter-region network dropouts or between the internet and the cloud data center? You're not going to be relying on a public internet connection to the cloud for critical workloads.

All that being said, there are plenty of workloads which I don't think fit well in the cloud operating model. You may very well have one of them.


> unless you build your applications in a cloud suitable way

Right, we have not done this. We basically decided it was cheaper to keep doing the “old school” thing and not spend a bunch of dev time trying to do it in a way that supports arbitrary failure of N boxes. We just spent the money to make it unlikely our boxes or networks will fail, and if they do fail we may be sad (but it rarely happens, and has not yet happened in a catastrophic way).

> Are these inter-region network dropouts or between the internet and the cloud data center?

I’ve seen multi-second dropouts within a DC (cloud, not my current company), and multi-hour single-path failures between DCs (usually something like a fire or construction cutting a line). But all of our DCs have at least 2 physically independent routes to the internet, so it’s never taken us fully offline.


You pay for cross-AZ Traffic in AWS, and that adds up really fast.


Yep. Got bitten HARD by this recently, $1.5k inter-az transfer charges that we never saw coming.

Our fault, I suppose -- but multi-az is prohibitively expensive if you need to run anything data heavy distributed.


I'm working on reducing a $50K per month bill for Inter-AZ traffic at the moment.

> but multi-az is prohibitively expensive if you need to run anything data heavy distributed.

If you communicate between your AZs via ALBs, multi-az is effectively free. Our bill is so high because within our Kubernetes cluster, our mesh isn't locality aware; it randomly routes to any available pod. 2/3rds of our traffic crosses AZs.


Yes. You've got to be aware of where those boundaries are when adopting the cloud and the cost information around these cases are inadequate at best. Too many people get surprise bills.


The first step is to stop blaming the victim.

It's nobody's fault that the billing structure at Amazon is so complicated and confusing.

Except Amazon's.


> Nothing. You generally only pay for data going out of cloud providers. Not data going in or data being transferred within the same region.

This is not true, AWS charges for all traffic cross AZ, so if you want a resilience within a region (which you mentioned in next paragraph) you will do a lot of that talking and this ends up a non trivial cost. You can do some optimization though by having your apps being aware of AZs they are on, but most places don't do that, more common are places that run everything from a single AZ.


> One obvious example is that EC2 has "scheduled maintenance events" where they force you to reboot your box. This would cost us a lot of money (mostly in dev time, to work around it).

It's interesting that you say this, because I think this just boils down to how you treat failures. Boxes will fail. It's inevitable, even if you run your own hardware.

We treat failure as an every-day event and have set up our systems so it doesn't matter. One box fails, an automated system notices and brings up another to replace it, while the remaining boxes take slightly more load for a few minutes.

Sure, that kind of failure tolerance doesn't come for free. But, again, you're going to have failures, no matter how much work you put into reliability (which also doesn't come for free!).


Seems you’re trying to bring your on-premise structure and concepts to the cloud, that won’t work. EC2 instances are cattle, not pets.

I believe I saw a slide that average lifespan of an EC2 instance at Netflix is 14 minutes.

I’m not necessarily saying cloud will work for your setup, but you can’t compare like that.


It seems to me like they're not trying to bring on-prem to the cloud, and that's very much the point.


Most businesses are not Netflix, where most of the work is done by their OpenConnect CDN appliances serving video content outside of cloud providers.


Computation over Petabytes of data sounds fairly expensive, jobs that I was running over close to a PB could cost hundreds of dollars. Am I misremembering, doing it wrong, or underestimating your teams’s cloud budget?


> We also have a number of machines with specs that would be hard to get e.g. on AWS.

What specs are those? I was under the impression AWS has everything from extremely tiny to giant terabytes-of-ram-for-SAP instance types.


We would need something like an X1 instance in terms of RAM, but it's hard to find something on AWS that has a well-tuned balance of RAM/CPU/disk for our needs. A lot of the big specialized instances are tuned for one particular limiting factor (RAM/GPU/storage/CPU/bandwidth/whatever) and I don't recall them having a good selection for "really big everything".

Amazon is constantly expanding the selection, so it's possible they've added something since we last seriously looked into this.


Just curious, can you share general details on the types of services that need really big everything?

For most every workload I can think of, as their load increases it's always one resource in particular that's the limiting factor. (RAM/CPU/Storage/etc). So it makes sense to me that AWS instances focus on optimizing one particular resource type.

Would be interesting to hear about types of workloads that break this pattern. (Or maybe it's a few different types of workloads/services that need to be tightly integrated on one machine?)


Redlining RAM + CPU is not difficult for the right kind of database application that needs in-memory latency. RAM is a limit on your (hot, at least) data size, and CPU is a limit on your query workload.

In my experience, harder to also balance I/O so you’re close to hitting three limits, but update-heavy transactional workloads can manage it for disk-based DBMS’s.


> For most every workload I can think of, as their load increases it's always one resource in particular that's the limiting factor.

So this is a rediscovery, at least an example, of what could be called the bottleneck principle of system performance analysis and optimization! Sooo, just look for the bottleneck(s), work on those, and f'get about everything else until, say, everything is equally a bottleneck at which time have a well balanced, call it optimized, which can be appropriate, configuration!

At times, e.g., at IBM's Watson lab, there has been a lot of work on applying queueing theory, analysis, and simulation to analyzing and then optimizing such systems, but, more closely to fully true than one might guess, all that really mattered were the bottlenecks(s)!


We just launched (Oracle Cloud) an AMD Epyc 2 Compute service (called E3 on our compute page) that is the beginning of 'shapeless' or flexible computing - you can spec the cores for your shape from 1-64 and drive the balance on storage and memory. That might be a fit for you, as it's really inexpensive or the power.


It could be that they need hardware that can support less common architectures, like Solaris or AIX.


Sort of sounds like why some large users will opt to go with a wholesaler like Digital realty, or maybe even Equinix.


Without revealing toomuch:

What industry are you in?

Are you concentrated in 1 geo location or... across USA / across 1 county / global?


> What industry are you in

Something plumbing-y; the company is not well known

> 1 geo location or...

Global, although probably 50% in the US


Guessing here...ad-tech


100% ad tech or some Mulesoft style middleware app.


MS and Boomi runs on AWS. Not those ones.


Mulesoft on AWS is a mess, though I'm not sure if that's Mule's fault, AWS's fault, or the fact that it's an integration platform and it's always a damn mess.


> because we have lots of fancy redundant infrastructure that we can't rely on from cloud companies)

Haha. This can't possibly be true.


I actually am surprised you would say something like that. Public cloud is infamous for their VMs being unreliable. The idea is that you should assume that VM disappears at any time, and you need to design your applications in a way to successfully handle it. It's kind of how Internet Protocol (IP) is unreliable, doesn't guarantee packet delivery or even that packets arrive in order, but TCP protocol can provide a reliable service.

I've seen on-premises, where reliability wasn't even objective machines that easily run for 5+ or more years without any interruption. Now parent said that they actually need reliability and you can achieve it using many technologies. Starting with RAID, dual PSU, or even hot swappable RAM or CPU (I remember SPARC machines allowed this). With full control of networking you can also make a standby node take over nearly instantaneous, when in AWS it might take couple minutes. You can achieve nearly any availability as long as you have enough money. In AWS you don't have any control and your only way is through designing your application in specific ways and that still has limitation. Just take look at RDS when everyone would want to have it instantaneous, but it usually will take few minutes.


Just to give a few examples, we are typically much more aggressive with RAID, ECC, redundant network links, redundant time sources, redundant cooling, redundant power supplies, etc. than cloud providers.


Cost notwithstanding (heh) and as a relative novice to the cloud world, it looks to me like there is no bounds to the level of redundancy in the big three clouds. The trick is to use cloud-native tooling vs. EC2.


Everything you mentioned here is table stakes for the major cloud providers.


Could a saturated network interface prioritize some other company's traffic over your own in AWS? Could the same happen in your own private network?

Consider that building fault-tolerance into their application may be very difficult, or impossible. Cloud would be incompatible with that.


we are not zoomers


Our business has had greatly increased load due to COVID-19. It would have been very nice to buy a 128 core EPYC bare metal server to run our SQL Server on at this time, to buy us time to rearchitect to handle the load. Instead we are stuck with 96vCPUs because that is the most Amazon can do.

Its also very very expensive to have a 96vcpu VM on amazon!


If you interested, here is an article about how Bank of America chose to build out its own cloud to save about 2 billion a year.

https://www.businessinsider.com/bank-of-americas-350-million...


Cost and data security, while cost actually weights more. It is simply not true that the cloud is cheaper.


>I honestly fail to see any good reason not to use the cloud anymore, at least for business. Cost-wise, security-wise, whatever-wise.

Problem is point of failure, many businesses need to be independent and having data stored in the cloud is a bad idea overall. Because it produces point of failure issues. Consider if we ever got a real nasty solar wind and the electric grid goes down, the more we rely on the internet and centralize infrastructure into electric devices, the more it becomes a costly point of failure.

While many see redundancy as "waste" in terms of dollars, notice that our bodies have many billions of redundant cells and that's what makes us resilient as a species, we can take a licking and keep on ticking.

Trusting your data to out-side sources generally is a bad idea any day of the week. You always want to have backups and data available in case of diaster, mishap, etc.

Like no one has learned from this epidemic yet. Notice that our economic philosophy didn't plan for viral infections and has forced our capitalist society to make serious adjustments. Helping people is an anathema to liberals and conservatives / republicans / democrats, so for void to come along and actually force co-operation was a bit tragically humorous.

As a general rule you need redundancy if you want to survive, behaving as if the cloud is almighty is a bad idea, I'm not sold on "software as a service" or any of that nonsense. It's just there to lull you into a false sense of security.

You always need to plan for the worst case scenario for surviveability reasons.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: