Hacker News new | past | comments | ask | show | jobs | submit login
De-cloud and de-k8s – bringing our apps back home (37signals.com)
568 points by mike1o1 on March 22, 2023 | hide | past | favorite | 409 comments



Regardless of the merits or drawbacks of "de-clouding" for this particular company, it seems to me that their ops team is just really bored or permanently unsatisfied with any solution.

They say that they've tried deploying their apps in all of:

* Their own Datacenter

* ECS

* GKE

* EKS

* and now back to their own Datacenter

Even with their new "de-clouded" deployment, it seems like they have created an absolutely immense amount of complexity to deploy what seems to be a variety of generic Ruby CRUD apps (I might be wrong but I didn't see anything different in the post).

They have a huge list of tools and integrations that they've tried out with crazy names; Capistrano, Chef, mrsk, Filebeat, traefik... It seems well on par complexity-wise with a full K8s deploy with all the bells and whistles (logging, monitoring, networking, etc.)

Google says that this company, 37signals, has 34 employees. This seems like such a monumental amount of orchestration and infra stuff unless they're deploying some crazy complex stuff they're not talking about.

Idk what the lesson is here, if there is one, but this seems like a poor example to follow.


We're talking about a product that has existed since 2004. They did:

* Their own data center, before Docker existed

* The non-K8s Docker way of deploying in AWS

* The GCP, and then AWS, ways of doing K8s Docker

* Docker in their own data center.

For 20 years of deployment, that doesn't look crazy to me.

The actual components of each environment are pretty standard. They wrote Capistrano, which predates Chef. Filebeat is a tiny part of ELK, which is the de facto standard logging stack. They use a smart reverse proxy, like... everybody else. It's easy to anything sound complicated if you expand the stack to as many acronyms as you can.


> * Their own data center, before Docker existed

Also, it might be worth calling out: their product launched in 2004, Linode and Xen were launched in 2003, S3 and EC2 launched in 2007. The cloud as we know it today didn't exist when they started.


Pretty sure they knew the linode folks and were on there early iirc my history. This from hanging out with one of the linode owners back then randomly at a bar in stl


Whether DHH is "right" in some philosophical sense, this is a small company with a lot of technical experience in a variety of technologies and with presumably a lot of technical chops, so generalizing their experience to "cloud is good" or "cloud is bad" isn't really possible.


I mean, I work for a cloud hosting vendor. I'm not saying one side or the other is right, only that people who are dunking on 37signals for this are telling on themselves.


Well, they were def at Rackspace in there somewhere.


"their own datacenter" both previously and now almost certainly means renting bare metal or colocation space from a provider. I highly doubt they have physically built their own datacenter from scratch


"renting bare metal or colocation space from a provider"

Those are two totally, completely different things. Their own datacenter means their own equipment in a datacenter and could even mean building out their own datacenter. It never, ever means renting bare metal.


>It never, ever means renting bare metal.

Weird, in my company where we are doing the opposite migration (from traditional datacenter where we manage the physical servers to Azure) this is exactly what we mean and say and how we describe it

We talk about "our datacenter" when we really mean racks of servers we rented from Insight, and we say "the cloud" when we refer to Azure. We've never actually had our own datacenter meaning a building we own and manage the entire physical plant of

Almost no one means it that way. Even Twitter is probably leasing colocation space in the "their own datacenter" category vs. GCP and AWS. The evidence is in the fact that Elon was able to just arbitrarily shut down an "entire datacenter". Or that 37signals was able to just arbitrarily move into "their own datacenter" on a whim


Referring to rented servers as colocated servers is flatly wrong, no matter how often people are incorrect about it. Sure, some providers put colocation under the same category as VMs and leased hardware, but that doesn't make them overlap.

OTOH, referring to a datacenter of servers that you lease as a datacenter is one thing, but if you have zero hardware that you own in it, would it really be your datacenter, or would it be "the datacenter"?

A datacenter could be anything from a set of IKEA shelves in a room with Internet and power to a fully built out fancy space with redundant power, fire suppression, a full Internet exchange, et cetera, so it's a bit gatekeepery to try to suggest that only huge companies would ever have their own datacenter or their own space with their own hardware in a datacenter.


I'm sure that's the truth of the matter.


The fun part is that they do not understand what it means to have your "own datacentert" vs renting server in a co-lo. It does not matter if you are running on AWS on Hetzner it is somebody else's computer.


We were a similar sized company at about the same time - we owned our data centers in the same way we owned our offices - we leased and occupied them. Sure, if the plumbing sprouted a leak the landlord would come to in and fix it, but no one would be confused enough to say we didn’t have our own office space.


"The fun part is that they do not understand" YES, 37Signals, I company with a legendary pedigree of pushing technical boundaries and open minded with deployment models totally doesn't know the simple thing that you do.

Get a grip.


You can rent entire rooms from Hetzner and then only you (and I believe the government firefighters) have access cards.

In any case, there are options where you 100% own and control all hardware.


What the heck are you talking about? Do you even know how colocation works?

For starters, even small companies can have their own physical datacenters, although that's not necessarily what we're talking about.

Second, renting hardware has absolutely nothing to do with colocation.


What do 37 signals do that makes money?


Let’s write our own container orchestrator though, because control planes are dumb?


I don't understand how the first clause in this sentence connects to the second.

With a simple, predictable workload --- what they have --- it can make sense to lean towards static scheduling, rather than dynamic schedulers. K8s and Nomad are both dynamic schedulers.

This is pretty basic stuff; it's super weird how urgently people seem to want to dunk on them for not using K8s. It comes across as people not understanding that there are other kinds of schedulers; that "scheduling" means what Borg did.


Because they already had it running in k8s.

And k8s scales very well to very low and high numbers.

Because k8s provides battle tested Features like rollout lb etc.

And the ecosystem is great.

Certmanager, argocd kube stack.

I'm baffled tbh how they had such a difference experience with k8s than I do


We did! And it did work. And there are def some great things that I (we) love about k8s. Personally, the declarative aspect of it was chef's kiss. "I want 2 of these and 3 of these, please", and it just happens.

Which is the primary reason why we did investigate k8s on-prem. We had already done the work to k8s-ify the apps, let's not throw that away. But running k8s on-prem is different than running your own k8s in the cloud is different than running on managed k8s in the cloud.

Providing all of the bits k8s needs to really work was going to really stretch our team, but we figured with the right support from a vendor, we could make it work. We worked up a spike of harvester + rancher + longhorn and had something that we could use as if it were a cloud. It was pretty slick.

Then we got the pricing on support for all of that, and decided to spend that half million elsewhere.

We own our hardware, we rent cabs and pay for power & network. We've got a pretty simple pxeboot setup to provision hardware with a bare OS that we can use with chef to provide the common bits needed.

It's not 'ultimately flexible in every way', but it's 'flexible enough to meet the needs of our workloads'.


What is your position at 37Signals and how do you like it? I'm really impressed by the innovation that comes out of you guys and the workplace culture you folks have.


I'm a Lead SRE on the Ops team. We've got a fantastic bunch of folks, they're amazing to work with!


The main issue is the ecosystem imho.

Bare vanilla k8s or k3s is nice but it doesn't do much outside of your homelab. Once you want k8s on production in the cloud you have to start about thinking of: - loadbalancing and ingress controller - storage - network - iam and roles - security groups - centralized logging - registry management - vulnerability scanning - ci/cd - gitops

And all this is no less complex with k8s than with nomad, bare docker or whatever they chose. And definitely no less complex because it is on a major cloud provider.


In all managed services all of that comes out of the box.

Ingress, lb, storage, network...

And I have my small setup running with all of it too. Took me a weekend to set it up.

Rke2, nginx ingress, classic lb in front, cert manager and everything else in argocd.


Hey Melingo, I noticed that you responded to a lot of different threads in this post. It seems like you are a bit dismissive of people's experiences using K8s. I have also run K8s at scale, and it is not easy, it is not out of the box in cloud providers. There are a ton of addons, knobs, and work that has to be doen to build a sustainable and "production ready" version of K8s (for my requirements) in AWS.

K8s is NOT easy, and I do not believe that in it's current form it is the pinnacle of deployment/orchestration technologies. I am waiting for what is next, because the pain that I have personally experienced around K8s that I know others are feeling as well does not make it a perfect solution for everything, and definitely not usable for others.

At the end of the day it's a tool, and it is sometimes difficult to work with.


Also when you do a mistake on a key part it can fail in a very spectacular way and it can be tricky to debug the issue immediately.

It is usually a game of finding the correct spaghetti(log) in a full plate.


I'm really only sharing my experience or view through my experience.

And I think it's the best thing for infra since pre cut bread.

In what issues did you run?


I know you are sharing your experience, others are as well. Let's not dismiss other's experience just because it doesn't match our own, the truth is most likely somewhere in the middle. Especially when so many people are clamoring saying that they had pain using K8s.

The initial deployment for EKS requires multiple plugins to get to something that is "functional" for most production workloads. K8s fails in spectacular ways (even using Argo, worse using Argo TBH) that require manual intervention. Local disk support for certain types of workloads is severely depressing. Helm is terrible (templating Yaml... 'nuff said). Security groups, IAM roles, and other cloud provider functions require deep knowledge of K8s and the cloud provider. Autoscaling using Karpenter is difficult to debug. Karpenter doesn't gracefully handle spot instance cost.

I could go on, but these are the things you will experience in the first couple days of attempting to use k8s. Overall, if you have deep knowledge of K8s, go for it, but It is not the end-all solution to Infra/container orchestration in my mind.

I fought with a workload for over a day with our K8s experts, it took me an hour to deploy it to an EC2 ASG for a temporary release while moving it back to K8s later. K8s IS difficult, and saying it's not has a lot of people questioning the space.

The way I see it is it starts off easy, and quickly ramps up to extremely complex. This should not be the case.

I worked at a company that had their own deployment infra stack and it was 1000x better than K8s. This is going to be the next step in the K8s space I believe and it may use K8s underneath the covers, but the level of abstraction for K8s is all wrong IMO and it is trying to do too much.


Deploying a fixed number of servers to a fixed number of hosts has been battle tested for the past 40+ years. It does work.


It definitely does not.

The main issues we faced with over 700VMs were: outdated os, full disks, full inodes, broken hardware, missing backups or missing backup strategy, oom.

K8s health itself, fixes out of memory by restarting a pod, solves storage by shipping logs out and killing a pod in case it still runs full, has a rollout startegy, health checks and readiness probes.

It provides easy deployment mechanism out of the box, adding a domain is easy, certificates get renewed centrally and automatically.

Scaling is just a replica number and you have node Autoupgrade features build in.

K8s provides what people build manually out of the box, certified, open sourced and battle tested.


The difference is it's likely possible to have 7 physical servers replace those 700VMs when you have your own hardware without all the overhead.

It is much easier to maintain when you look at those numbers.


Not in my case.

Every VM had 4 cores and 20gig me.

Run on quite big blades


Your case is fine. AMD's 4th-Gen EPYC Genoa processors can do up to 192 cores and 384 threads in a single machine (2 socket) with TBs of RAM.

In most cloud environments a "core" can be just a thread.

Older machines have had 4-CPU socket based chassis with many cores as well. Definitely doable.


Ansible is your friend. Btw, we're talking about the team that built Capistrano, so they certainly know how to automate deployments.


Nope Ansible is horrible in comparison to k8s.

Alone the Paradigma shift from doing things step by step vs describing what you need and than things happen on it is a game changer.

K8s is probably 100x easier than Ansible.

And Ansible also has it's bigger ecosystem like Ansible tower.

Basically your k8s control plane but in bad


> Alone the Paradigma shift from doing things step by step vs describing what you need and than things happen on it is a game changer.

I've actually used both in conjunction and it was decent: Ansible for managing accounts, directories, installed packages (the stuff you might actually need to run containers and/or an orchestrator), essentially taking care of the "infrastructure" part for on-prem nodes, so that the actual workloads can then be launched as containers.

In that mode of work, there was very little imperative about Ansible, for example:

  - name: Ensure we have a group
    ansible.builtin.group:
      name: somegroup
      gid: 2000
      state: present
  
  - name: Ensure that we have a user that belongs to the group
    ansible.builtin.user:
      name: someuser
      uid: 3000
      shell: /bin/bash
      groups: somegroup
      append: yes
      state: present
This can help you setup some monitoring for the nodes themselves, install updates, mess around with any PKI stuff you need to do and so on, everything that you could achieve either manually or by some Bash scripts running through SSH. Better yet, the people who just want to run the containers won't have to think about any of this, so it ensures separation of concerns as well.

Deploying apps through Ansible directly can work, but most of the container orchestrators might admittedly be better suited for this, if you are okay with containerized workloads. There, they all shine: Docker Swarm, Hashicorp Nomad, Kubernetes (K3s is really great) and so on...


I'm on GKE. The hosts and control plane are managed for me. All I need to do is build/test/security scan images and then promote/deploy the image (via Helm) when it goes out to prod.

Using config management and introducing config drift and management of the underlying operating system is a lot more to think about, and a lot more that can go wrong.


Deploying a fixed number.of instances to a fixed number of servers does not imply doing it manually.


And I didn't say that.

We had all of these problems with self developed automatisation.

It still was garbage.

K8s just solves those issues out if the box.


So you did automatisation in a broken way. Here's one way to avoid the issues you described on bare metal:

- Only get servers with IPMI so you can remote reboot / power cycle them.

- Have said servers netboot so they always run the newest OS image.

- Make sure said OS image has a config that isn't broken so you don't get full inodes and so it cycles logs.

- Have the OS image include journalbeat to ship logs.

- Have your health checks trigger a recovery script that restarts or moves containers using one of a myriad of tools; monitoring isn't exactly a new discipline.

Yes, it means you have to have a build process for OS images. Yes, it means you need to pick a monitoring system. And yes, it means you need to decide a scheduling policy.

I wrote an orchestrator pre-K8S that was fewer LOC than the yaml config for my home test K8S cluster. Writing a custom orchestrator is often not hard, depending on your workload, - writing a generic one is.

K8S provides one opinionated version of what people build manually, and when it's a good fit, it's great. When it isn't, I all to often see people spend more time trying to figure out how to make it work for them than it would've taken them to do it from scratch.


Your own failures do not define a model.


And?

My experience still counts for something and the example with those 700 VMS is something I didn't just saw once.


Having huge sprawling swarms of VMs is, for some teams, a problem to be solved, not a fact of life to be designed around.


Sry I'm not getting your point.

If I understand it right: VMS were not there because people needed VMS they were there because people needed compute.

We moved everything to k8s and we were able to do this because k8s can


The point is to deliver a small set of applications, not to come up with the most horizontally scalable possible deployment fabric.


I ran 1000+ VMs on a self developed orchestration mechanism for many years and it was trivial. This isn't a hard problem to solve, though many of the solutions will end up looking similar to some of the decisions made for K8S. E.g. pre-K8S we ran with an overlay network like K8S, and service discovery, like K8S, and an ingress based on Nginx like many K8S installs. There's certainly a reason why K8S looks the way it does, but K8S also has to be generic where you can often reasonably make other choices when you know your specific workload.


And you don't think k8s made your life much easier?

For me it's now much more about proper platform engineering and giving teams more flexibility again knowing that the k8s platform is significantly more stable than what I have ever seen before.


No, I don't for that setup. Trying to deploy K8S across what was a hybrid deployment across four countries and on prem, colo, managed services, and VMs would've been far.more effort than our custom system was (and the hw complexity was dictated by cost - running on AWS would've bankrupted that company)


[flagged]


I'm not bragging.

I'm not a 'bro' and 'cringe' this is not tiktok.

It gives context.

Don't you have anything to add to the discussion?


> They have a huge list of tools and integrations that they've tried out with crazy names; Capistrano, Chef, mrsk, Filebeat, traefik

These tools are pretty stock standard in the Systems Engineering world. I think anyone that's been a Systems Engineer that's over 30 has probably deployed every one of these.

One thing I've learned over my mixed SWE and SE career is that infrastructure is expensive and grows regardless of revenue. I didn't truly appreciate this until I launched Kubernetes on Digital Ocean and began running my personal cloud on it. It was costing me over $100/m for very little. That money was gone whether I pushed a ton of VPN traffic over my mesh or not. It didn't care about how much I stored in the disk I reserved, and frankly, that cost was going to grow as time went on. I pulled the plug, setup servers in my house, wired up Traefik and Docker Compose V2 with a little Tailscale sprinkled on. The servers stay up to date with some scripts, I deploy new apps on select servers with Docker Compose and Docker profiles.

It's possible for companies to do similar things, but not to the extremes I took it to. A really good infrastructure SWE generally goes for $300k. You can pay people with an expertise in these things who can streamline them and create maintainable products out of your infrastructure or you can pay for Legos and glue from a managed service provider like AWS, GCP, or Azure. At some point the latters costs will not scale, you'll pivot and cost reduce many times - maybe even begin rearchitecting. I think there's a lot of companies that are noe realizing the cheap money is gone, and the cloud has somewhat relied on cheap money.


This is the company that gave birth to Ruby on Rails. They appear to have a culture of being very opinionated about their tools and unafraid of doing things their own way.

Probably not an example most companies that size ought to follow but I'm glad they were crazy enough to do it!


I think you’re right that they doing it for fun or because they can, primarily. But I am excited to see them pioneer in this area, both because it’s more open and hacker friendly, but also because they’re moving the needle towards healthy competition amongst the providers.

Our big-three cloud hegemony has already showed it’s ugly sides, both in terms of price (egress, anyone?) and quality (hello, zero interop and opaqueness). I’d argue we’ve seen significant complexity increases in especially server-side tech in the last 5-10 years, with relatively little to show for it, despite massive economic investments. I expect that trend to continue downwards unless we take back the hacker friendliness of infra & ops.

PS. Actually scratch that I’m excited, that’s an understatement. I’m thrilled!


Kubernetes is a poster child on open source and transparent development.

And it pushed the needle tremendously.

I'm lost on how you can compare this even.


Pioneer? Other than rewriting docket swarm it sounds like a stack from the early 2000s…


I wonder how much of these movements are them iterating and hunting for ROI in their infrastructure costs. Did GCP and AWS salespeople sell them on the benefits of the cloud, offer discounts, white glove migration help, showed some calculation on how much $$ they will save in the cloud, etc that on paper sounded great, but wasn’t ultimately a good fit?

Their market is probably saturated and perhaps declining that they are reaching for optimizations elsewhere.


There is no such thing as "saving money in cloud". It is all about convenience and it always costs more than a smart team could achieve elsewhere.

I tend to hear an argument that it is cheaper since you do not have to pay people to maintain those services, but in reality you still need that person to set up and maintain your particular cloud setup. And the services themselves are much much more expensive than maintaining your own servers in a data center.

In my opinion cloud hosting and services are more meant for large corporations where no one wants to take responsibility and is scared of doing anything. Cloud is a nice way to shift the blame if/when things go bad - "but cloud is industry standard, everyone does it".


Indeed.

Hacker news crowd is drinking their own cool aid on this topic and not recognizing how much costs can be avoided if they just drop EKS from their stack.

Remember that in SRE all the abstractions are leaky and thus having more abstractions means having more complexity not less.


Yes, when that grows you can build a new corporate team babysitting control tower.


If you have a fairly stable traffic pattern, hosting your own stuff tends to be significantly cheaper than any cloud provider.


When I read stuff like this it strikes me that probably, by far, their largest operational expense is their staffing cost to orchestrate all of this. I come from a background of running small startups on a shoe string budget. I need to make tough choices when it comes to this stuff. I can either develop features or start spending double digit percentages of my development budget on devops. So, I aim to minimize cost and time (same thing) for all of that. At the same time, I appreciate things like observable software, rapid CI/CD cycles, and generally not having a lot of snow flakes as part of my deployment architecture. I actually have worked with a lot of really competent people over the past two decades and I like to think I'm not a complete noob on this front. In other words, I'm not a naive idiot but actually capable of making some informed choices here.

That has lead me down a path of making very consistent choices over the years:

1) no kubernetes and no microservices. Microservices are Conways Law mapped to your deployment architecture. You don't need that if you do monoliths. And if you have a monolith, kubernetes is a waste of CPU, Memory, and development time. Complete overkill with zero added value.

2) the optimal size of a monolith deployment is 2 cheap VMs and a load balancer. You can run that for tens of dollars in most popular cloud environments. Good enough for zero down time deployments and having failover across availability zones. And you can scale it easily if needed (add more vms, bigger vms, etc.).

3) those two vms must not be snow flakes and be replaceable without fanfare, ceremony, or any manual intervention. So use docker and docker-compose on a generic linux host, preferably of the managed variety. Most developers can do a simple Dockerfile and wing it with docker-compose. It's not that hard. And it makes CI/CD really straight forward. Put the thing in the container registry, run the thing. Use something like Github actions to automate. Cheap and easy.

4) Use hosted/managed middleware (databases, search clusters, queues, etc). Running that stuff in some DIY setup is rarely worth the development time and operational overhead (devops, monitoring, backups, upgrades, etc). All this overhead rapidly adds up to costing more than years of paying for a managed solution. If you think in hours and market conform rates for people even capable of doing this stuff, that is. Provision the thing, use the thing, and pay tens of dollars per month for it. Absolute no brainer. When you hit thousands per month, you might dedicate some human resources to figuring out something cheaper.

5) Automate things that you do often. Don't automate things that you only do once (like creating a production environment). Congratulations, you just removed the need for having people do anything with teraform, cloudformation, chef, puppet, ansible, etc. Hiring people that can do those things is really expensive. And even though I can do all of those, it's literally not worth my time. Document it, but don't automate it unless you really need to and spend your money on feature development.

But when I need to choose between hiring 1 extra developer or paying similarly expensive hosting bills, I'll prefer to have the extra developer on my team. Every time. Hosting bills can be an order of magnitude cheaper than a single developer on a monthly basis if you do it properly. For reference, we pay around 400/month for our production environment. That's in Google cloud and with an Elastic Cloud search cluster included.

Other companies make other choices of course for all sorts of valid reasons. But these work fine for me and I feel good about the trade offs.


Agree entirely. I think system design interviews are partly to blame because they select for people who think that the only way to design a system is the cargo cult method that interview prep books and courses preach, which is:

- break everything into microservices

- have a separate horizontally scalable layer for load balancing, caching, stateless application server, database servers, monitoring/metrics, for each microservice.

- use at least two different types of databases because it's haram to store key-value data in a RDBMS

- sprinkle in message-passing queues and dead-letter queues between every layer because every time you break one system into two, there can be a scenario where one part is down but the other is up

- replicate that in 10 different datacenters because I'll be damned if a user in New York needs to talk to a server in Dallas

And all this for a service that will see at most 10k transactions per second. In other words, something that a single high-end laptop can handle.

99.9% of the time your architecture does NOT need to look like Facebook's or Google's. 99% of tech startups (including some unicorns) can run their entire product out of a couple of good baremetals. Stop selecting for people who have no grounding of what is normal complexity for some given scale.


I can't agree more on this. Most products out there with medium to low traffic can be handled just fine like this. The cost of automation is often not worth the financial effort.

There's a dangerous trend in putting microservices everywhere. Then having the same level of quality as a monolith requires an infinite amount of extra work and specialized people. Your product must be very successful to justify such expenses!

My rule of thumb; monolith and PaaS as long as your business can afford to.


I mean it all makes sense if you know nothing of k8s or ansible.

Most companies these days had moved to k8s so there are a portion of hi tech workers that have prior knowledge of k8s model and deployment.

Whether you want to go monolith or not doesn't matter because you need to replicate the process at least to 2 environment: dev and prod. Not to mention it's good to be prepared had your prod env got compromised or nuked.


Where, oh god where, are there more sensibly thinking people like you! This is pragmatic and straight forward. There is very little room for technical make work nonsense in your described strategies. Most places, and many devs I meet cannot imagine how to do their jobs without a cornucopia of oddly named utilities they only know a single path of use.


This is actually a really interesting post to me. I'm currently working at the opposite of a startup with a shoestring budget. We're a medium-sized company with 100 - 150 techies in there. As a unique problem, we're dealing with a bunch of rather sensitive data - financial data, HR data, forecast and planning data. Our customers are big companies, and they are careful with this data. As such, we're forced to self-host a large amount of our infrastructure, because this turns from a stupid decision into a unique selling point in that space.

From there, we have about 7 - 12 of those techies working either in my team, saas operations, our hardware ops team, or a general support team for CI/images/deployment/buildserver things. 5 - 10% of the manpower goes there, pretty much.

The interesting thing is: Your perspective is our dream vision for teams running on our infrastructure.

Like - 1 & 2 & 3: Ideally, you as the development team shouldn't have to care about the infrastructure that much. Grab the container image build templates and guidelines for your language, put them into the template nomad job for your stuff, add the template pipeline into your repository, end up with CD to the test environment. Add 2-3 more pipelines, production deployments works.

These default setups do have a half life. They will fail eventually with enough load and complexity coming in from a product. But that's a "succeed too hard" kinda issue. "Oh no, my deployment isn't smooth enough for all the customer queries making me money. What a bother" And honestly, for 90% of the products not blazing trails, we have seem most problems so we can help them fix their stuff with little effort to them.

4 - We very much want to standardize and normalize things onto simple shared services, in order to both simplify the stuff teams have to worry about and also to strengthen teams against pushy customers. A maintained, tuned, highly available postgres is just a ticket, documented integrations and a few days of wait away and if customers are being pushy about the nonfunctional requirements, give them our guarantees and then send them to us.


The only point I disagree with is Terraform. It is brilliant for this exact scenario because it's self documenting. When you do need to update those SPF records in two years time, having it committed as a Terraform file is much better than going through (potentially stale) markdown files. It's zero maintenance and really simple. Plus its ability to weave together different services (like configuring Fastly and Route53 from the same place) is handy, too.


What if I do this with Terraform using AWS Serverless and staying in the free tier for this workload that you are referencing instead of VMs and a load-balancer?

I just don't see why people prefer the VM based approach over serverless.


If you can stay in the free tier you likely don't need a load-balancer either.


From my experiences there are two lessons:

There is usually a sweet spot in terms of size where being on the public cloud makes sense, both from a cost and management perspective. Once you go above that size then having to manage IAM starting becoming a pain. Usually around the same point public cloud costs start becoming noticeable to your finance team and so you have to start dealing with questions around that. Usually that's a good point to do a sanity check before things get even bigger and more expensive.

Similar k8s works well for certain classes of problem, but doesn't work well for all classes of problem. Any form of k8s has an operational overhead and you really need to make sure that you are going to get the ROI from the effort of maintaining the stack for it to be worthwhile.


> having to manage IAM starting becoming a pain

Just use multiple AWS accounts. You won't need any complicated IAM policies.


> Idk what the lesson is here, if there is one, but this seems like a poor example to follow.

The lesson is not to focus on tech tooling as much and focus more on product instead. Imagine this energy doing into the product....


> Idk what the lesson is here

I'd say the lesson is that we, as an industry, haven't figured out this "cloud" stuff yet.

And it looks to me that what we need, roughly, is some sort of "deployment polymorphism" that separates interface from implementation.


In addition to what other commentators have stated: 37signals has ~ 80 employees, not 34.


> They have a huge list of tools and integrations that they've tried out with crazy names; Capistrano, Chef, mrsk, Filebeat, traefik

I use a lot of this or similar (terraform instead of Chef, logstash instead of filebeat) and I'm a one person team. If anything these tools make my job a lot easier and less complex.


This is very common in almost all web companies since around 2015.

I've never seen a company with a simple infrastructure, no matter how simple their actual application actually is.

If you choose a slow dynamic language (Ruby/Python) your deployment has to be massively complicated; you have no choice about it.

For one simple reason: you will need a multitude of separate components to be made to work together.

You need many application instances because there's no way one machine can handle all your traffic: Ruby is just too slow.

A sharded database cluster as a source of truth:

You went through the effort of making several applications instances with a load balancer: you don't want a single database server to be a single point of failure.

A distributed redis/memcache index to accelerate queries and lower the pressure on the real database.

You might have several index-type engines for different types of quries. Most people use ElasticSearch in addition to Redis.

You need some system to manage all this complexity: monitor it, deploy new versions, rollback to a previous version, run migrations and monitor their state, etc etc.

This is the bare minimum. Most people have a setup that is way more complicated than this. I don't even know how they come up with these complexties, but they not only come up with frequently: they love it! To them it's a feature, not a bug.


You are making a lot of assumptions and many of those are not universal problems or even at all.

Compiled languages eventually need a complicated setup for the very same reasons. There is no such thing as "scales" and "doesn't scale". Even Go or C++ webapps have to be scaled up.

If you can get away without complexity on Go or whatever, good for you. Most companies don'T.


So, you’re explaining a stack with:

- application instances

- load balancer

- database

- cache

- search cluster if application search is necessary

Sounds like any cookie cutter application to me, even modern ones. How is that complicated?


It's way too complicated. But if this is all you have ever seen and if you've been designing such systems for a decade, this seems like normal to you.

Here's an alternative stack that can handle over 99% of websites:

- Self contained executable

- One-file database

- Cache is memory

- Text search is a library function

- Indexing is a library function

- Serving http is a library function

Such a stack can handle > 20k concurent connections (per second). The code doesn't need to be "optimized"; just non-pessimized.

You can scale "vertically" for a very long time, specially in 2020 and beyond, where you have machines with over 64 CPU cores. That's almost a cluster of 64 machines, except in one single machine.

If you _must_ scale horizontally for some reason - maybe you are Twitter/Facebook/Google - then you can still retain the basic architecture of a single executable but allow it to have "peers" and allow data to be sharded/replicated across peers.

Again all the coordination you need is just a library that you embed in the program; not some external monster tool like k8s.


There are several reliability issues:

  1) a single panic/exception/segfault in the executable brings down the whole website and so it will be unavailable until the executable restarts

  2) entropy *always* increases (RAM usage, memory corruption, hardware issues, OS misconfiguration etc.) so eventually the application will break and stop serving traffic until it's repaired/restarted (which can take time if it's a hardware issue)

  3) deployments are tricky if there's nothing before the executable (stop, update, restart => downtime)

  4) if cache is in-process, on a restart it will have to be repopulated from scratch, leading to temporary slowdowns (+ and maybe a thundering herd problem) which will happen *every time* you deploy an update
I think much of it is ignoreable if the site is just a personal blog or a static site. But if the site is a real time "web application" which people rely on for work, then you still need:

  1) some kind of containerization, to deal with inevitable entropy (when a container is restarted, everything is back to the initial clean state)

  2) at least two instances of the application: one instance crashes => the second one picks up traffic; or during rolling updates: while one instance is being killed and replaced with a new version, traffic is routed to another instance

  3) persistent data (and sometimes caches) need to be replicated (and backed up) -- we've had many hardware issues corrupting DBs

  4) automatic failover to a different machine in case the machine is dead beyond repair
>not some external monster tool like k8s

What can you use instead of k8s for this kind of scenario? (an ultra reliable setup which doesn't need a whole cluster)


It seems to me that people tend to vastly overestimate their uptime requirements. "Real time 'web application'" used by hundreds of millions of people can be down for hours and yet succeed wildly, just look at Twitter, both its old failwhale and new post-Musk fragile state. Complexity, on the other hand, and thus lower iteration speed and higher fixed costs can kill a business much easier than a few seconds of downtime here and there.

You don't need an "ultra reliable setup" or even a "cluster". You can have one nginx as a load balancer pointing at your unicorn/gunicorn/go thing, it's very unlikely to ever go down. You can run a cronjob with pgdump and rsync, in an off chance your server dies irrecoverably corrupting the DB (which is really unlikely for Postgres), chances are your business will survive fifteen minutes old database.

Most "realtime web applications" are not aerospace, even though we like to pretend that's what we work on. It's an interesting confluence of engineering hubris and managerial FOMO that got us here.


> It seems to me that people tend to vastly overestimate their uptime requirements. "Real time 'web application'" used by hundreds of millions of people can be down for hours and yet succeed wildly

That may be true for social media apps where the Terms of Service don't include any SLAs/SLOs, but if you're a SaaS company of any kind, the agreements with clients often include uptime requirements. Their engineers will often consider some form of "x number of nines" industry standard.


In the projects I work on, things go down all the time, for various reasons (hardware issues, networking problems, cascading programming errors). It's the various additional measures we have put in place which prevent us from having frequent outages... Before the current system was adopted, poor stability of our platform was one of the main complaints.

I agree that for many projects it may be an overkill.


Networking issues and even hardware issues are very unlikely if you can fit everything into one box, and you can get a lot in one box nowadays (TB+ RAM, 128+ core servers are now commodity). MTBF on servers is on the order of years, so hardware failure is genuinely rare until you get too many servers into one distributed system. And even then, two identical boxes (instead of binpacking into a cluster, increasing failure probability) go a very long way.

It's a vicious circle. We build distributed multi-node systems, overlay software-configured networks, self-healing clusters, separate distributed control planes, split everything into microservices, but it all makes systems more fragile unless enough effort is spent on supporting all that infrastructure. Google might not have a choice to scale vertically, but the overwhelming majority of companies do. Hell, even StackOverflow still scales vertically after all these years! I know startups with no customers who use more servers than StackOverflow does.


Re: Crashes.

If there's a bug that brings the server down, it will happen in all instances and repeatedly no matter how many times you restart. Specially when the users keep repeating the action that triggered the crash.

Re: Entropy. Entropy increases with complex setup. The whole point of not having a complex setup is to reduce entropy and make the system as whole more predictable.

Re: caches. There are two types of caches: indicies that are persisted with the database, and LRU caches in memory. LRU caches are always built on demand so this is not even a problem.

Plus modern CPUs are incredibly fast and can process several GBs of data per second. Even in the worst cases, you should be able to rebuild all your caches in a second.


>If there's a bug that brings the server down, it will happen in all instances and repeatedly no matter how many times you restart.

Not necessarily so. Many bugs are pretty rare bugs which are triggered only under specific conditions (a user, or the system, must do X, Y, Z at the right moment). So it doesn't happen all the time. But when it happens, the whole server crashes or starts behaving in a funky way and other users are affected. Sure you may say if it's a rare bug, then users will be rarely affected. But we don't have a single bug like that, there's always N such bugs lurking around (we never know how many of them in a large application); multiply it by N bugs and you have server crashes for different reasons quite often, making your paying customers dissatisfied. It also assumes you can fix such a bug immediately while it's not always true, there's often Heisenbugs it takes weeks to root out and fix, while your customers are affected (sure the application will restart but ALL users (not just the one who triggered the bug) can loose work, get random errors when the app is not available -- not a good experience). So having several app instances for backup allows to soften such blows, because there will always be at least one app instance which is available.

>Entropy increases with complex setup. The whole point of not having a complex setup is to reduce entropy and make the system as whole more predictable

I agree that entropy increases with complex setup, but there's also base entropy which accumulates simply because of time (which I think is more dangerous). Like make a sufficient number of changes to the setup of your application (which you often need if you release often) and eventually someone or something somewhere will make a mistake or expose a bug somewhere, and you will need to repair it and you won't be able do it easily because your setup is not containerized which would allow to return to the clean state quite easily with no effort. We've had issues like that with our non-containerized deployments and it's a very complex and error-prone undertaking to do it flawlessly (no downtime or regressions) compared to containerized deployments.

>Plus modern CPUs are incredibly fast and can process several GBs of data per second. Even in the worst cases, you should be able to rebuild all your caches in a second

Hm, usually caches are placed in front of disk-based DB's to speed up I/O, i.e. it's not a matter of slow CPU's, it's a matter of slow I/O. Rebuilding everything which is in the caches from DB sources is not super fast.


> and you will need to repair it and you won't be able do it easily because your setup is not containerized which would allow to return to the clean state quite easily with no effort.

Automated deployment including server bringup is orthogonal to using containers or hot failover. For example at $WORK we're deploying Unreal applications to bare metal windows machines without using containers because windows containers aren't as frictionless as linux ones and the required GPU access complicates things further.


Note that you can totally have more than one instance of the same app/binary running on the same machine. You don't even need containers for that.


But then you need some kind of load balancer, which hsn915 said was "too complicated".


Upfront customer requirements often say they want >99.5% uptime (which allows for 3.5h downtime a month anyway) or some such. In practice B2B customers often don't care much if hour-long downtimes happen every week during off-hours. Sometimes they're even ok when it gets taken down over a whole weekend. Things serving the general public have different requirements but even they have their activity dips during the late night where business impact of maintenance is much lower.


> 2) entropy always* increases (RAM usage, memory corruption, hardware issues, OS misconfiguration etc.) so eventually the application will break and stop serving traffic until it's repaired/restarted (which can take time if it's a hardware issue)*

This is not what entropy means. Even if you constrain it to hardware, there is no reason to think that this will happen eventually, unless your timeline is significantly long.


Also, there are typically multiple processes. A panic stops only one process.


> - Text search is a library function

What text search will provide me with the same features as Elasticsearch? Index time analysis, stemming, synonyms; search time expansions, prefix matching, filtering and (as a separate feature) type ahead autocomplete?

I would love to never touch another Elasticsearch cluster so this is a genuine question.


What about any of this prohibits it from being a library?

https://lucene.apache.org/core/

This is the Java library that ES is based on. Without even having to look at it I can make the following judgement:

It should be easy to port to any language.

It's open source, and it's Java. Java has no special features that makes it impossible or particularly difficult to replicate this functionality in any other compiled language, like C, Rust, Go, or any other language that is not 100x wasteful of system resources.


> This is the Java library that ES is based on.

Based on, but Elasticsearch is not just a server wrapped around the library. Features ES has are not in Lucene, otherwise anyone could release a competitor by wrapping the library.

> It should be easy to port to any language.

You win the "Most Hacker News comment of March 2023" award. This thread is talking about less effort, and you bring up porting Lucene to another programming language.


I thought it was already ported to other languages eg. https://clucene.sourceforge.net/

Not sure about feature parity though.


> Based on, but Elasticsearch is not just a server wrapped around the library. Features ES has are not in Lucene, otherwise anyone could release a competitor by wrapping the library.

Those competitors exist.


Go is not less wasteful than java, both are garbage collected and their memory pressure depend highly on the given workload, and the runtime of the program. But java allow more GC tuning and even different GCs for different use cases (ie: shenadoah and ZGC favor very low latency workloads, while the default G1GC favors throughout (not that simple, but you get the point))

Regardless, Java/Go tier of performance is good enough for this kind of thing.


I was referring to Ruby/Python when I said 100x wasteful languages.


Problem is it doesn't support HA. You're stuck on that single server model. Upgrades always = downtime = painful. You're also missing things like self-healing and your Lucense index can corrupt.

Real world experience says better to move away from it e.g. lots of self-hosted Atlassian instances over the years. Lucene was a major pain point.


Manticoresearch provides mosts of the listed features.


Thanks for the reminder. Manticoresearch is an alternative I haven't tried yet. I tried the hip alternatives (Melisearch, Typesense) in autumn 2022 and both were severely lacking for CRM workloads compared with ES.


>- One-file database

SQLite?

If yes, then I dont really believe that you can have 20k concurrent users where significant part goes to db, not cache.

But Ive been messing with just 1 vCore, so.


You can always put an LRU cache between you and SQLite.

I personally moved from SQLite to a B-Tree based key-value store, and most requests can be serviced in ~500us (that is microseconds). I don't mean a request to the database: I mean a request from the client side that queries the database and fetches multiple objects from it. In my SQLite setup, the same kind of query would take 10ms (that is 20x the time) even _with_ accelerator structures that minimize hits to the DB.

But you can always scale up vertically. You can pay $240/mo for 8 vCPUs with 32GB of RAM. Much cheaper than you would pay for an elastic cloud cluster.


>> ~500us (that is microseconds).

500us is slow. This kind of performance does not remotely obsolete an LRU cache (main memory access is ~5000X faster).

500us is essentially intra-datacenter latency. Obviously your data is in memory on the B-Tree server as there is no room in this budget for disk IO. Postgres will perform just as well if data is in memory hitting a hash index (even B-Tree probably). I don't think the B-Tree key-value store you mention is adding much. Use Redis or even just Postgres.


When you say text indexing and serving http are library functions, what do you mean? Also, is the language here go or what? Since you said python is too slow and then necessitates all the infra to manage it.


Go or any language that actually gets compiled down to machine code to get executed directly on the hardware, and where libraries are compiled into the final product.

When I say something is a library function, I mean you just compile it into your code. In your code, you just call the function.

This is in contrast to the current defactor practice of making an http request to ask another program (potentially on a different machine) to do the work.


Beautiful. Got it; thank you.


Sometimes I think, maybe our complex cluster which runs PHP software (load balancer, app instances, cache etc.) can be replaced with a single performant machine running something like Rust


It can. You don't even need to go all the way to Rust. I'm doing it with Go, which has a GC and a runtime. A single executable on a single machine can handle millions of users per month.


37signals and RoR have a habit of flip flopping on their decisions. See CoffeeScript.


Each of these "flip flops" probably lasted a good deal longer than the median 20+ person startup, so that seems pretty facile. But the parallel with CoffeeScript seems valid --- people on message boards are really not OK with nonstandard languages, and are never less happy than when a company they've heard something about does actual computer science of any kind. See, for instance, Fog Creek and Wasabi.


CoffeeScript impacted ECMAScript more than anything else.


There's an operator for that.


Skimming the thread here, it seems like there's some confusion about the goals:

* They've decided to move from EKS to on-prem largely because of cost. That's logical: almost by definition, it costs more to run workloads on cloud machines than on your own hardware. You can't address that problem by moving from EKS back to ECS, like one commenter suggested.

* They've decided to move from K8s to mrsk, a system they developed. They're fuzzier about why they did that, but the two fairly clear claims they made: (1) their deployments under K8s are a lot more complicated, and (2) they slashed their deploy times (because a great deal of their infra is now statically defined).

I feel like there's more productive debating to be done about K8s vs mrsk than there is about EKS vs. mrsk. By all means, make the case that applications like Tadalist are best run on K8s rather than a set of conventions around bare-metal container/VMs (which is all mrsk is).


Yeah, I would love to hear more about why they decided not to go with on prem k8s... the other arguments made logical sense to me, but they don't explain the reasoning for mrsk very well.


Every company that I have been at that uses k8s at scale ends up having an internal team to manage the complexities and build internal tooling to make it work. It sounds like they left behind a lot of the cruft and just built a tool that does the one thing most people want: put a container on a VM and call it good.


K8s is very easy to keep maintained, especially because even for self hosted there are plenty of management tools around.

And it allows your teams to deploy everything they need without a admin team.

Even ingress you can define


K8s on bare metal is not easy to maintain.


Have you tried using gardener or rancher or ubuntus solution on bare metal?

They are very easy to use


That's the thing. On-prem K8s doesn't mean deploying a vanilla Kubernetes using instructions from kubernetes.io. There is an entire industry of proprietary solutions for running Kubernetes on-prem. RedHat Openshift, Rancher, Pivotal PKS, VMWare Tanzu come to mind.


I don't know when they decided to do that transition but back when I tried rancher a few years ago (when they were transitionning from rancher 1.x to 2.x) it was a real bug festival. I think the only robust solution at the time was openshift which was well, k8s without being vanilla k8s.

Also most tools that were build to manage k8s cluster were nice to deploy a new cluster, not so much to upgrade a cluster so you would have to create new clusters every time you wanted an upgrade. It can scale when clusters and blast radius is small but can be complicated when it involves contributions from n teams. For this reason when we were managing our own k8s cluster on prem, we were using kubespray which worked but upgrades were a multiple hours affair.


That's a real good point you mentioned: k8s ecosystem is super young.

And so so much changed in the last 4 years.

But at least for me, the 'easy to use' threshold happend somewhere like 2-3 years ago.

And Gardner for example upgrades quite well.

Rke2 is quite stable for me but rancher integration is still not perfect.

But even doing k8s by hand with Ansible was already double 3 years ago. That's what I started and I had it up and running. I switched to rke2 because I realized that this will not be sustainable/ is not worth it to do it myself on this level.


I haven't used k8s in quite a few years, what would you recommend I look at these days to get a good overview and understand all the different pieces in the ecosystem?


Unfortunately I don't have a good blog article about this.

I actually thought it would be good to write a k8s blog article After Reading this blog.

If you can, click yourself a small cluster in one of the big cloud providers.

Alternativly I think Google has some k8s colabs were you can try it out.

Setting up a small application yourself or looking into what helm charts exist might help .

Like the helm charts for known open source projects like PostgreSQL from bitnami.


> By all means, make the case that applications like Tadalist are best run on K8s rather than a set of conventions around bare-metal container/VMs (which is all mrsk is).

Okay sure I'll bite. An application like Tadalist is best run on k8s.

With any application regardless of how it runs, you generally want at least:

- zero-downtime deploys

- health checks

- load balancing

Google's GKE is like $75/mo, and the free tier is one cluster, which is enough. For nodes, pick something reasonable. We're naive so we pick us-west1 and a cheap SKU with 2 vCPUs 8 GB is ~$30/mo after discounts. We're scrappy so we eschew multiple zones (it's not like the nearby colo is any better) so let's grab two of these at most. Now we're in $60/mo. We could go cheaper if we want.

We've click-opsed our way here in about 25 minutes. The cluster is up and ready for action.

I write a Dockerfile, push my container, install k3d locally, write about 200 lines of painstaking YAML that I mostly cribbed off of stack overflow, and slam that through kubectl into my local cluster. Everything is coming up roses, so I kubectl apply to my GKE cluster. My app is now live and I crack open a beer. Today was a good day.

Later, whilst inebriated from celebration, I make some changes to the app and deploy live because I'm a cowboy. Oops, the app fails to start but that's okay, the deployment rolls back. No one notices.

The next day my app hits the front page of HN and falls over. I edit my YAML and change a 2 to a 10 and everything is good again.

Things I did not need to care about:

- permissions (I just used the defaults and granted everything via clickops)

- ssh keys (what's ssh?)

- Linux distributions or VM images (the Goog takes care of everything for me, I sleep better knowing I'll wake up to patched OS)

- passwords

- networking, VIPs, top of rack switches, hosting contracts, Dell, racking and stacking, parking, using my phone

And all without breaking the bank.

---

Okay so I cheated, you weren't looking for a GKE vs on-prem/Colo case. You asked

> make the case that applications like Tadalist are best run on K8s rather than a set of conventions around bare-metal container/VMs

to which I say: that's all kubernetes is.

Did you even read their blog post? virtio? F5? MySQL replication?? How is this a good use of engineering time? How is this cost efficient? On what planet is running your own metal a good use of time or money or any other godforsaken resource. They're not even 40 people for crying out loud. It's not like they're, say, Fly.io and trying to host arbitrary workloads for customers. They're literally serving rails apps.

Want to start small with k8s? Throw k3s or k3d on a budget VPS of your choosing. Be surprised when you can serve production traffic on a $20 Kubernetes cluster.

If you care about Linux distributions, and care about networking, and care about database replication, and care about KVM, and care about aggregating syslogs, and love to react to CVEs and patch things, and if it's a good use of your time, then sure do what 37signals did here. But I'm not sure what that planet is. It's certainly not the one I live on today. 10-15 years ago? Sure. But no longer.

I can't believe just how ridiculous this entire article is. I want to find quotes to cherry pick but the entire thing is lunacy. You can do so so much on a cloud provider before approaching the cost of even a single 48U in a managed space.

At some scale it makes sense, but not their scale. If I never have to deal with iDRAC again it'll be too soon.

You have a horse in this race: apps like Tadalist are best run on something like Fly or knative/Cloud Run or Heroku rest in peace. But a set of conventions around bare-metal containers/VMs? Give me a break.

I don't think you intended it, but I find it disingenuous to separate cloud hosting and kubernetes. The two are connected. The entire premise is that it should be a set of portable conventions. I can run things on my laptop or desktop or raspberry pi or $10/mo budget VPS or GCP or AWS or Azure or Linode or god willing a bunch of bare-metal in a colo. It's fundamentally a powerful abstraction. In isolation it makes little sense, which TFA handedly demonstrates. If you eschew the conventions, it's not like the problems go away. You just have to solve them yourself. This is all just NIH syndrome, clear as day.

Forgive the long winded rant, it's been a long day.


Agree, I would never want to go back to the old bad days of managing a real rack at a datacenter, with exactly the same guarantees of a single region deployment inside any cloud. BUT it is true that all the multi region/AZ guarantees + logs + dashboards + network services @ AWS costs tend to skyrocket in a couple of years. And here is where k8s really shines, in my opinion: allowing you to abstract your deployment away from a cloud even on cheap hosting. All the rest outlined in the article is just reinventing the wheel.


Usually those aren't the same engineers that manage the racked stuff in a datacenter and those that deploy the appps.

Last time I was working on prem, we would just buy a new 2U hypervisor server once in a while. Apps were all running on VMs anyway so the complexity was not seen by the same people. Storage were a multiple years deal. The biggest issue was storage estimation and paying from day 1 a storage that would be used fully only on year 5. But I don't think it was that expensive, just an accountability gymnastic comparing to a pay as you go system. And hyperconvergence was kind of meant to solve that although I didn't really had the chance to experiment with it in virtualized environments on prem.


> That's logical: almost by definition, it costs more to run workloads on cloud machines than on your own hardware

Only if you are not adjust the workload. Lift+shift is cost more. Re-architecting and making the workload cloud ready cost less.


Who's gonna do the rearchitecting work? Are you hiring a whole new team or do you not need to keep the lights on while you're transitioning? Depending on the complexity of your application that rearchitecting is gonna eat up a ton of your cost savings.


> almost by definition, it costs more to run workloads on cloud machines than on your own hardware.

Why should that be so? I'd expect the all-in cost of a cloud machine to be less than my own hardware, for the same reason that buying electricity from the grid is usually cheaper than generating it on-prem.

> You can't address that problem by moving from EKS back to ECS, like one commenter suggested.

If EKS is more expensive (because it's something they see as a value-add) whereas ECS is a commodity service at commodity prices, then moving there could well solve the cost issue.


A better analogy for cloud vs on-premises is going to a restaurant vs cooking. The markups are about the same too.


Wouldn't the cost of cooking be higher depending on who you are? If one could spend few hours doing something that has higher ROI than eating pre-cooked food, then you are actually losing money cooking your own food.


Yep, which makes it a pretty good analogy.


This virtually never happen with the cloud.


You're paying for someone elses profit monthly with the cloud.

It has to cost more.


Beyond a certain scale, sure. But at small scale, you can completely avoid hiring an ops team, or hire a much smaller one, which can more than offset the cloud provider price premium.

My current company works in a niche market with a smallish number of large customers, so our scaling needs are modest. Our total AWS bill is about a third the annual salary of a single ops person.

There's gotta be a very long tail of companies like mine for whom outsourcing to cloud vendors is cheaper than self-hosting.


Depends on the industry and barrier of entry. If your in one with alot of compliance overheads your are outsourcing alot more them compute and storage to your cloud provider. Hiring inhouse in that same case its extremely expensive unless you are over a certain size.

This article seems written by someone who gets excited by shiney objects / hype trains.


What if someone's else profit is due to economies of scale?


With like 40-50% margins too.


> Why should that be so? I'd expect the all-in cost of a cloud machine to be less than my own hardware

Because cloud hardware doesn't have all the burdens of physically managing a real server. Replacing SSDs. Upgrading RAM. Logging to a iDRAC to restart a crashed server. All those things don't exist in the cloud and make you loose so much operational time. That's why clouds will ALWAYS cost more than bare metal. The cons is that with cloud you keep paying for the same servers: there are no assets anymore, only costs.


Not to mention keeping spare parts around for when something breaks, or having to drive out to the DC to fix/replace the thing that broke or won't restart. Hell, even something "simple" like managing the warranties for the gear you have is no fun at all. People tend to forget all those little things when espousing the evils of the cloud, but I'm here to tell you that they all add up and they are all a major pain in the butt. Cloud gets rid of all that.


There are also discussions around CapEX versus OpEx that apply here, and depreciating costs over time. There is a trade-off of agility, cost, and maintenance, but the markup on cloud is quite high.


The major determinant in hosting cost isn't power, it's the cost of the hardware. But I mean, even if you don't buy my axiomatic derivation, you can just work this out from AWS and GCP pricing.


> The major determinant in hosting cost isn't power.

Let's do the math then.

https://www.hetzner.com/news/neue-dedicated-server-2023/

Let's assume the Hetzner EX101 consumes 200 W (equals 0.2 kW) on average.

Let's assume a private household electricity price of 0.40 EUR/kWh (Germany).

The monthly electricity cost will be; 0.40 EUR/kWh x 0.2 kW x 24 h/day x 30 day/month = 57.60 EUR/month for electricity.

The Hetzner EX101 costs 84 EUR/month.

So even with self-energy production and cheap electricity buy prices, power / electricity is very significant.


At least in the US, businesses/commercial/industry typically get a significant discount on electrical pricing vs consumers.


I always saw it being close to 7:3 with non recurring hardware cost to mrc facilities & power on 3 year depreciation for major markets.

That said all of the big cloud providers SHOULD have a structural advantage on all of those dimensions. None if the small players or self hosting shops are doing the volume, much less the original r&d, of the big cloud providers. The size of that discount, and how costly it really is to achieve, is another topic.

Disclosure: principal at AWS. Above information is my personal opinion based on general experience of 20 years in the industry doing networking, compute farms, and operations.


Even if [0] cloud does have structural advantage, it’s clear that cloud vendor isn’t willing or wanting to pass them off to customers, and tends to nickel and dime on other necessity like the infamous bandwidth cost.

[0] I’m really curious how big, if any, structural advantage large cloud vendor has over small-time colo user, because surely cloud comes with all kinds of overhead? All the fancy feature AWS provides cannot be free. If customer does not care for those, would colo, or a small “vps” vendor, actually have structural advantage over AWS?


AWS’s 24% of operating margin does not appear out of nowhere.


If someone is making profit does not mean you are making a loss. Both consumer and producer could end up making profits depending on the situation


Cloud is reasonably cheap until you need to move data out (eg to serve html to customers). Egress charges are where the big players hit you.


The comments in this thread are quite eye-opening.

It really shows what a sacred cow k8s and cloud has become.

I’m not much of an ops person so I’m not qualified to comment on what 37 signals has created. But I will say I’m glad to see honest discussion around the costs of choosing k8s for everything even if it has significant mindshare.

Perhaps this is the endgame of resume-driven development: cargo culted complexity and everyone using the same tech for similar-ish problems and then wondering why it’s so hard to stand out from both a product and an employee perspective.


I am much of an ops person and I will say k8s has its place and its in 10-20% of companies max.

It is absolutely ridiculous how many places use k8s and rarely for what it can really do.

Cargo culted complexity, indeed.


Some people are really good at writing software, other people are really good at running systems. k8s/cloud allowed the former to pretend to be good at the latter.


k8s is misunderstood. Everyone focuses on the complexity/over-engineering/etc arguments when those really don't matter in the grand scheme of things.

It's not about any of that, it's about having a consistent API and deployment target that properly delineates responsibilities.

The value of that then depends on how many things you are running and how many stakeholders you have taking part in that dance. If the answer to both of things are small then k8s value is small, if the answer to either of those is high then the value is high.

i.e k8s is about organisational value, it's technical merits are mostly secondary.


The "it's too complex" argument usually reflects more on the commenter than on kubernetes itself. It's actually one of the most very straight forward and thoughtfully designed platforms I've ever worked with.

What I've found in my experience is that applications in general are complex -- more complex than people assume -- but the imperative style of provisioning seems to hide it away, and not in a good way. The inherent complexity hides behind layers of iterative, mutating actions where any one step seems "simple", but the whole increasingly gets lost in the entropic background, and in the end the system gets more and more difficult to _actually_ understand and reproduce.

Tools like ansible and terraform and kubernetes have been attempts to get towards more definition, better consistency, _away_ from the imperative. Even though an individual step under the hood may be imperative, the goal is always toward eventual consistency, which, really only kubernetes truly achieves. By contrast, MRSK feels to be subtly turning that arrow around in the wrong direction.

I'm sure it was fun to build, but one could have spent 1% of that time getting to understand the "complexity" of kubernetes - by the way, which quickly disappears once it's understood. Understandably, though, that would feel like a defeat to someone who truly enjoys building new systems from scratch (and we need those people).


You've hit the nail on the head. Ten thousand simple, bespoke, hand-crafted tools have the same complexity as one tool with ten thousand facets. The real velocity gained is that this one tool with ten thousand facets is mass produced, and in use widely, with a large set of diverse users.

I don't know a single person who's been responsible for infra-as-code in chef/terriform/ansible who isn't more or less in love with Kubernetes (once they get over the learning curve). Everyone who says "it's too complex" bears a striking resemblance to those developers who happily throw code over the wall into production, where it's someone else's issue.

> Understandably, though, that would feel like a defeat to someone who truly enjoys building new systems from scratch (and we need those people).

Exactly. Building new systems from scratch is tons of fun! It's just not necessarily the right business move, unless the goal was to get the front-page of HN, that is.


I'll take this bait:

Nomad is better for smaller teams and smaller companies with smaller problems than what k8s is for.

Helm is an abomination on top of it but that seems to be slowing down, thankfully.


I've been using Nomad for about 5 months now, and couldn't disagree more. K8s is better documented, with far less glue, and far more new-hire developers are familiar with K8s compared to Nomad. Nomad-autoscaler alone is becoming a decent reason not to use Nomad. The number of abandoned issues on the various githubs is another. That Vault is a first-class citizen of K8s and a red-headed-stepchild of Nomad is another.

I do agree about Helm tho, I avoid it as much as possible.


Fair enough, I don't know anything about nomad-autoscaler.


I hate kubernetes as much as anyone, but building your own container orchestration platform so that you can deploy a handful of CRUD webapps sounds a lot more like resume-driven development than using a well-known and standard (if somewhat overengineered) solution.


I don't think the authors care about their resumes at this point. There are rational reasons to use a static scheduling regime and a set of conventions around deployment and support services rather than a dynamic scheduler. If it were me, I'd build this with Nomad, but I can imagine not wanting to manage a dynamic scheduler when your workloads are as predictable as theirs are --- you make their case for them when you point out that they just have a "handful of CRUD apps".


There are other options between k8s and “build your own”.


What is there really? There is docker swarm, which doesn't seem to be really further developed, and... what else?

This whole space seems to be neglected since cloud providers are trying to sell k8s to big company "devops"guys but old school sysadmins don't even know what docker is. Any development in this area is very welcome.


> Perhaps this is the endgame of resume-driven development: cargo culted complexity and everyone using the same tech for similar-ish problems and then wondering why it’s so hard to stand out from both a product and an employee perspective.

Spot on. Tech is a fashion industry and most people just follow trends. I still sometimes wonder if people are playing the elaborate long-term resume-optimisation game, or if they don't value simplicity highly enough to optimise for it, because the downsides are externalised.


If there's discussion it isn't a sacred cow


k8s folks get paid big money to keep it running. Not surprised by the comments here at all. As the saying goes, "in complexity, there is opportunity." and the k8s devops team is milking it hard.


Agreed. I was having similar feelings reading comments.


Only one sentence about why they chose to abandon K8s:

> It all sounded like a win-win situation, but [on-prem kubernetes] turned out to be a very expensive and operationally complicated idea, so we had to get back to the drawing board pretty soon.

It was very expensive and operationally complicated to self-host k8s, so they decided to build their own orchestration tooling? Sort of undercuts their main argument that this bit isn't even remotely fleshed out.


We are talking about 37Signals here. This is the company that, when faced with the problem of making a shared to-do list application, created Ruby on Rails. And when they decided to write up their remote working policy, published a New York Times bestselling business book.

This is not a company that merely shaves its Yaks. It offers a full menu of Yak barber services, and then launches a line of successful Yak grooming products.


...and don't forget the time where they wrote a blog post, kept posting through it, and a significant amount of the company quit.


I do forget that time -- what's the context?



> significant amount of the company quit.

no they didn't


Yes, they did. This is not a debatable fact. IIRC, it was 30%+ of the company.


i am sure you will supply proof for your claims.


I was at the company when it happened. I'm currently at the company. I'm in ops and work on all of the mrsk/de-clouding efforts.


Has the political change lead to a better or worse work environment?


ha right on! Must've be real awkward for the people who didn't quit in hottest tech job market of all times :D



> at least 20 people — more than one-third of Basecamp’s 57 employees — had announced their intention to accept buyouts from the company.

Thanks for subjecting me to this crap article ( Which i presume you didn't bother to read.).


The article seems to provide evidence for the claim that a dispute within the company over the messaging from leadership led to 1/3 of the staff leaving. I provided it without comment.

Do you believe that a significant proportion of the staff did not quit? Do you have an alternative source that provides evidence for that version of events?


intention to leave = staff leaving ?

then scarlett johanssen is my wife because i intend to marry her.

> Do you have an alternative source that provides evidence for that version of events?

Yes because people go around documenting evidence for things did not happen.


announced their intention to leave... to the company... in response to the company making an open offer to people of terms for them to leave.

That seems like a slightly different prior, in terms of our Bayesian assessment of the probability that those people remained employed at the company afterwards, than your hypothetical engagement to Ms Johannsen.


> to the company

Where did you get this though?

> had announced their intention to accept buyouts from the company.

Is it just people clicking 'yes' reaction to internal slack message ? This didn't sound like they were making any commitment ' to the company' .

Also do you have any comment about the title of the article that you linked. Does that seem honest to you?


So strange to white-knight a company and attempt to deny something that happened pretty publicly...

> As a result of the recent changes at Basecamp, today is my last day at the company. I joined over 15 years ago as a junior programmer and I’ve been involved with nearly every product launch there since 2006.

https://web.archive.org/web/20210430155528/https://twitter.c...

https://web.archive.org/web/20210430140035/https://twitter.c...

https://twitter.com/zachwaugh/status/1388190748189802501

> I’m leaving my position at Basecamp, where I’ve worked for 4 years, due to the recent changes and new policies.

https://twitter.com/lexicola/status/1388189598367559688

https://twitter.com/dylanginsburg/status/1388199059983413257

https://twitter.com/jonasdowney/status/1388205182916440070

> Given the recent changes at Basecamp, I’ve decided to leave my job as Head of Design.

https://twitter.com/mackesque/status/1388206605506842627

https://twitter.com/kaspth/status/1380616358266871810

https://twitter.com/wcmoline/status/1388208323908968449

> I have left Basecamp due to the recent changes & policies.

https://twitter.com/conormuirhead/status/1388207801646780416

https://twitter.com/Rahsfan/status/1388209146487623681

https://twitter.com/AdamStddrd/status/1388223100823642112


> So strange to white-knight a company and attempt to deny something that happened pretty publicly...

it was just skepticism from seeing these sorts of claims over the years. Half of hollywood would be in canada if people really followed up on those. At some point it became acceptable to make these sort of claims with no intention of following up.

I guess quitting your job in the hottest tech market of all time is a little different than moving to a different country.


> Last week was terrible. We started with policy changes that felt simple, reasonable, and principled, and it blew things up internally in ways we never anticipated. David and I completely own the consequences, and we're sorry. We have a lot to learn and reflect on, and we will. The new policies stand, but we have some refining and clarifying to do.

https://world.hey.com/jason/an-update-303f2f99


Wouldn't surprise me if they're doing all that with Yakety Sax[1] blaring in the background.

[1]: https://www.youtube.com/watch?v=ZnHmskwqCCQ


They seem to have lost their touch though. I think they peaked with Remote.

After typing that I found that they renamed from Basecamp Inc. back to 37signals and their website is trying to hearken to their past. https://en.wikipedia.org/wiki/37signals

Edit: lol https://37signals.com/22/


You could just look this up. They renamed to Basecamp because they decided to be a single-product company (at the same time, they divested Highrise and Campfire). Six years later, they launched HEY, their email product, so "Basecamp" stopped making sense as a name. They wrote a post about this last year.

later

I added "six years later", but I don't think it changes the meaning of what I wrote originally.


People were used to calling them 37signals, so even if that's the sole reason they renamed, it's more complex than that


Most HN people probably never stopped calling them 37signals, so this seems like an especially weird thing to get hung up on.


Are they dead-naming a company? Isn't that illegal in California already?


That decision was always baffling to me. Basecamp is such a UX nightmare even Jira looks good next to it...


I don’t think they were ever Basecamp Inc, I think they were always an LLC.


Sometimes there's value in building bespoke solutions. If you don't need many of the features of the off-the-shelf solution, and find the complexity overwhelming and the knowledge and operational costs too high, then building a purpose-built solution to fit your use case exactly can be very beneficial.

You do need lots of expertise and relatively simple applications to replace something like k8s, but 37signals seems up to the task, and judging by the article, they picked their least critical apps to start with. It sounds like a success story so far. Kudos to them for releasing MRSK, it definitely looks interesting.

As a side note, I've become disgruntled at k8s becoming the defacto standard for deploying services at scale. We need different approaches to container orchestration, that do things differently (perhaps even rethinking containers!), and focus on simplicity and usability instead of just hyper scalability, which many projects don't need.

I was a fan of Docker Swarm for a long time, and still use it at home, but I wouldn't dare recommend it professionally anymore. Especially with the current way Docker Inc. is managed.


I think people overindex on thinking that Kubernetes is about scalability.

Honestly, its inbuilt horizontal scaling systems are pretty lacking. Scaling is not actually K8s's strong suit - sure, you can make it scale, but that takes effort and customization.

But what K8s, at base, is actually useful for is availability.

You tell K8s how many instances of a thing to run; it runs them; if any of them stop running, it detects that and tries to fix it.

When you want to deploy a new version, it replaces the old instances with new ones, while ensuring traffic still gets served.

And it does all of this over a substrate of shared underlying server nodes, in such a way that if any of those servers goes down, it will redistribute workloads to compensate.

All of that is useful even if you don't care about scale.


> simplicity and usability instead of just hyper scalability

This is such a key phrase here.

If I'm starting a small SaaS company tomorrow, my ideal for setting up infrastructure would be a stack which can for now look similar to what this article sets up (especially with the tremendously lower bills), but with an easy migration path to k8s, should I hit the jackpot and have that 'very good problem to have' of too many customer requests to handle.

My big issue with k8s, and honestly with other big fancy toolsets, is that getting started with it requires you to choose between:

- Hire several seasoned cloud orchestration experts, preferably with the platform you've chosen (AWS, GCP, Azure) who will know how to troubleshoot this beast when you have a mysterious issue, or: - YOLO it! Just follow the basic tutorials to set k8s up, and hope you don't end up sitting up all night with a site that's refusing connections while your customers flee.

The first one is the only responsible choice but it's going to add another half million to your cash burn, and that's on top of the high-margin "managed" service cloud bills like RDS.

So I can see why people are drawn to a system where instead of paying for k8s and "Postgres in a box" they can pay for a simple server and have simple tooling to deploy, back up, etc.


> and focus on simplicity and usability instead of just hyper scalability

I don't get why people in IT need a boogeyman. Looks like "k8s is hard" is the new "pointers are hard" we use to scare other people now.


That's not a great comparison, but it works in a sense. Not all languages and applications benefit from pointers.

The issue is not about k8s being hard. Yes, it has a steep learning curve, but many technologies do. The issue is that learning all of its intricacies, and maintaining it and the services that run on top of it, requires valuable resources many companies don't have, especially early on. And the benefits it provides are, for the most part, not needed early on in a project's lifecycle, and often never. In financial terms, it's very high risk, with low ROI.

If there's a solution that lowers the investment and maintenance costs, while being valuable in the short and long term, then that's generally a more favorable solution for most projects that don't operate at Google's scale.


There is the learning curve, which can be challenging for organizations that aren’t experienced or exposed to scale and performance expectations. When a company moves away from being insular & proprietary to using open source there is a period of churn that ripples through the deployment, implementation and day to day operations aspect of products that live either on customer premises or a cloud platform new to everyone.

There, what YOU know from experience and have evolved and worked through is unknown—because it is all new. And “training” (such as it is) is left as an exercise for each individual.

I’d expect that is the norm for the traditional non startup firms, globally.


There is a very big difference from being a user of K8s and being someone maintaining a K8s cluster.

If you are a user of K8s, then yeah, deploying apps is pretty simple most of the time.

Maintaining a K8s cluster on the other hand becomes very complex and hard the moment you have use cases that are a few steps off the happy path. The K8s documentation is not sufficient for operating a K8s cluster on your own hardware, you end up having to go spelunking in the code to see how things work (this is from experience).


Concept of pointers you can explain in one sentence. K8s introduction course takes 3 hours.

K8s is not hard, but complex.


Yet stackoverflow is full of questions on pointers.


Pointers are hard though, for the average programmer as is memory management.

When you transition an IT team or a customer facing product support team to DevOps, most everything appears complex if the implementation has been done by engineers new to DevOps and cloud itself. Engineers with zero background in scale out or performance for larger customers. It is a cultural/experience change that faces issues at actual deployment time.

I’m watching that play out at work now.


I'm happy with my usage of k8s, but I think it's unfortunate that current container abstractions are so oriented around imperative assembly in "layers". I want a way to run NixOS in a container and have it feel first class— existing approaches either require installing everything every time with no caching, or pre-building and manually lifecycling your container (streamLayeredImage), or knowing upfront what you're going to need (Nixery).


> Especially with the current way Docker Inc. is managed.

I was reviewing GCP's Free Tier today, they have the same approach, if they need to change or drop services they agree to give 30 days notice, same as Docker did. It's probably common for other cloud companies offering free stuff as well. All the negative attention Docker received was fully and wholly undeserved.


> I was reviewing GCP's Free Tier today, they have the same approach

Google is notoriously bad about this and gets negative attention from it, so the comparison isn't favorable, and the publicity is still wholly deserved.


>> I was reviewing GCP… > Google is notoriously bad at this…

Do you mean Google or GCP? We don’t see complaints about AWS because Amazon closes Dash buttons or Spark, and also Azure is not seen in any worse light due to Microsoft discontinues Skype and what not.

Can we name one remotely popular service of GCP that has been shut down at all?


I can't think of a single incident where GCP actually dropped a free tier; I actually see new free tier stuff added since the last time I looked. If you can provide some excellent links to reflect your view, that I've somehow missed along the lines; it would be interesting to compare.

Until then, I maintain the Docker publicity is undeserved and if I had to guess, was brought on by podman astroturfers who have been polluting the web the past 2 years claiming how great podman is.


> wouldn't dare recommend it professionally anymore

please provide more insight, thanks


It seems a bit confusing. Here's some context: https://github.com/docker/roadmap/issues/175


Doesn't seem to change anything though? Swarm is still alive and well?


Yeah I'm a bit surprised to hear that. I had only heard a lot of teams giving up swarm when it was deprecated. Didn't know they just restructured the project.


Yup its crappy communication from docker, the ticket is asking for clarity to put an end to the swarm is dead meme.

Even mirantis is back supporting swarmkit and adding new features.


> It was very expensive and operationally complicated to self-host k8s, so they decided to build their own orchestration tooling?

You are deeply misunderstanding Kubernetes if you think it's some sort of a turnkey solution that solves all your infrastructure problems. Virtually everything of value in Kubernetes isn't Kubernetes -- you have to add it on later, and manage it yourself. Container runtime? -- that's not Kubernetes. Database to store deployment info? -- that's not Kubernetes. Network setup and management? -- that's not Kubernetes. Storage setup and management? -- still not Kubernetes.

When you start using Kubernetes for real, you will end up replacing almost every component it brings by default with something else. CoreDNS? -- sucks for big systems. Volumes? You aren't going to be using volumes from local filesystem... that's insane! You'll probably set up Ceph or something like that and add some operators to help you use it. Permission management? -- Well, you are out of luck in a major way here... you have, basically, Kyverno, but it really, really sucks (and it's still not Kubernetes!).

Real-life Kubernetes deployments end up being very poorly stitched together piles of different components. So much so that you start wishing you'd never touched that thing because a huge fraction of the stuff you now need to manage is integration with Kubernetes on top of the functionality provided by these components.


> You are deeply misunderstanding Kubernetes if you think it's some sort of a turnkey solution that solves all your infrastructure problems. Virtually everything of value in Kubernetes isn't Kubernetes -- you have to add it on later, and manage it yourself. Container runtime? -- that's not Kubernetes. Database to store deployment info? -- that's not Kubernetes. Network setup and management? -- that's not Kubernetes. Storage setup and management? -- still not Kubernetes.

When you install Kubernetes, you get a container runtime. That's a distribution I guess. Part of this seems like GNU/Linux.

The other stuff you're listing isn't solved by MRSK either...


But seems like with mrsk you do not need that other stuff. With Kubernetes you highly likely have to deal with most of that stuff.


You can fairly easily set up a not-very-distributed Kubernetes cluster and you won't need as much stuff.


I don't know, for small scale, K8S rocks: I just fired up Kubespray and have a 20-node cluster up and running in maybe an hour, CoreDNS haven't gave me any problem so far.

Using local volumes is actually not an insane idea if your stateful service can handle data replication themselves: many modern databases can.


Local volumes don't have a concept of quota. You cannot limit them to X bytes. So, if you give a single service a volume, it might just take the whole disk. Well, technically, it might just take the whole filesystem, which, if you have multiple disks used by a single filesytem, will mean it'll take all of them.

Obviously, you cannot move local volumes around.

And if you are setting up a database in Kubernetes... oh, you are in such a pit of troubles, that dealing with local volumes isn't really even worth mentioning. Surprisingly, your problems don't even start with storage, they start with memory. Databases really like memory, but use it very opportunistically, and scale well with load. So, when you configure your database, you tend to give them all the memory you have, but when they use it, it will really depend on the load and the kind of queries, how well they optimize it. Since Kubernetes scheduler doesn't really do well with reservations, you may run into situations where your database OOMs or just slows everything down, or doesn't perform well at all...

Next comes fsync. Unlike many unsophisticated applications, databases don't like losing data. That's why they want to use fsync in some capacity. But this creates problems sharing resources, again, well beyond anything Kubernetes can help with.

Next comes provisioning of high-quality storage for databases... and storage likes to come in the form of a device, not filesystem, but Kubernetes doesn't know how to deal with devices, so, it needs a help from CSIs of all sorts to do that, and depending on technology you choose, you'll have a very immersive journey into the world of hacks and muti-thousand page protocol descriptions telling you how to connect your storage and Kubernetes.

It might appear though, at the first glance, that things work well w/o much intervention, and there's a Helm chart for this or the other provider, and it's all at the tips of your fingers... but, as it often is in the world of storage, things get extremely complicated extremely quickly in case of errors. In such situations, Kubernetes will only obscure the problem. Oh, and errors in storage don't usually happen in the next hour or day or even year after you've set it up. It hits you few years later, once you've accumulated a ton of useful data and you've entirely forgotten how things have been set up, and folks in Kubernetes had moved on and broke stuff.

---

So, not only do you need small scale, you also need a very short temporal scale: don't expect your Kubernetes cluster to work well after about a year of being deployed. Probably not at all after five years.

But then... if it only works at small scale and for short time? -- is it really worth the trouble? I mean, Kubernetes isn't a small thing, it takes away a big constant share of your resources, which it promises to amortize with scale. You are essentially preaching the same idea as Electron-based desktop applications or Docker containers that create a lot of duplication of entire Linux user-space + a bunch of common libraries if you aren't extremely careful with that. Doesn't it become an argument for producing hot garbage as fast as possible so that someone else who can do a better job won't get a chance of selling their goods because they didn't have time to deliver?


Man, you really like to complicate stuffs just to take a dig at K8S.

>> Local volumes don't have a concept of quota. You cannot limit them to X bytes. So, if you give a single service a volume, it might just take the whole disk.

That's why we monitor our server disk for usage.

>> Obviously, you cannot move local volumes around.

Most of the time, this is not a requirements for database.

>> Since Kubernetes scheduler doesn't really do well with reservations, you may run into situations where your database OOMs or just slows everything down, or doesn't perform well at all

Unless it's test cluster with constraint resources, no other services will run on database nodes, through the use of taint and toleration. We can let the database use all the CPU and memory it wants

>> fsync Doesn't matter with local volume, since it's just a directory on the host system.

>> Next comes provisioning of high-quality storage for databases... and storage likes to come in the form of a device, not filesystem We didn't deploy our databases with raw block devices, even before K8S. Using regular filesystems make everything much simpler and we did not see any performance difference.

>> You are essentially preaching the same idea as Electron-based desktop applications or Docker containers that create a lot of duplication of entire Linux user-space + a bunch of common libraries

Yeah, no. If that's how you read it, be my guest, but don't put words into my mouth.


It seems they are migrating for migration's sake.

AWS ECS -> Google GKE -> AWS EKS -> Self hosted Kubernetes -> bespoke solution.

They are pretty much changing their tech stack every few years, never maturing their knowledge of their current tech stack.

I am sure it makes sense to self host, but given their track record I am wondering if they wil be migrating to the cloud again in a few years.


To be fair that served them well in the past: the reason why anyone knows about 37signals is because they reinvented the wheel back in 2004 with Rails, but what a great reinvention it was. Who knows what can come next.


Which wheel did they reinvent? Rails literally set a bunch of standards used by just about every framework today… app generators, conventions over configuration, asset pipelines, you name it.


Well, as with all homebrewn solutions, you don't know if you are reinventing the wheel until you're done. At first, it always starts with "the current solutions that are available do not fit me, but I still could use them to achieve what I want". There was nothing forcing 37signals back in 2004 to roll their own framework in order to support developing their apps, but they did anyway.

And for every Rails out there, there are thousands of internal frameworks with big ambitions that just turned out to be inferior to what's already available. You just can't know it when you start developing. It takes a bit of ego and ambition to go that path, but sometimes it pays off. And my guess is that if it paid off in the past, you're more likely to try it again.


I think what they wanted to convey was not the redundancy of 'reinventing the wheel', but the ambitious scope and from-scratch approach associated with the phrase.

Maybe 'rolled their own' or 'first invented the universe' would have been slightly better.


But does doing something like that once 20 years ago mean they can do it again?


I dunno. I was a kid when Rails took over the world, so I couldn't even begin to tell you why it succeeded in the way that it did.

But I do feel like they probably know what they're doing enough to have a more modest version of success with this other project, i.e., meeting their own needs well without burning up too much money or time. They're still a really small, focused company, and they have a lot of relevant experience.


Well to be fair Kubernetes doesn't always pluralize the names of collections, since you can run "kubectl get deployment/myapp". You don't want to do the equivalent of "select * from user" do you? That doesn't make any sense!!! And don't translate that to "get all the records from the user table"! That's "get all the records from the users table". (Rails defaults to plural, Django to singular for table names. Not sure about the equivalent for Kubernetes but in the CLI surprisingly you can use either)


To be fair, from the article it says that they built the bulk of the tool and did the first migration in a 6-week cycle. mrsk looks fairly straight forward, and feels like Capistrano but for containers. The first commit of mrsk is only on January 7th of this year.

In less than a six-week cycle, we built those operational foundations, shaped mrsk to its functional form and had Tadalist running in production on our own hardware.


They spent a month and a half building tooling _capable of handling their smallest application_, representing an extremely tiny fraction of their cloud usage.


The whole second half of the article is about why they decided to stop using K8s.


k8s is an industry standard currently, but it is not great. The lack of available free/open tooling to set up and manage them (the cluster) properly seems to indicate that it is also a way of selling it (cloud). Meaning that if you want to use k8s you have to go with the large cloud providers, otherwise your life will be painful.

I for one am patiently waiting for more innovation in this area and seeing that there are companies that try to disrupt/improve it makes me hopeful and I appreciate it.


Whatever their reasons, I'm very happy to be able to use a much simpler alternative to K8s to orchestrate my apps on bare metal servers.


k3s is lightweight and even I have clusters running, I can sync them too if I wish, easily, I agree, it seems odd they didn't go with some kube design on prem.


I'm so lost on so many choices this company did.

You de cloud and now use some mini tool like mrskd?

I'm running k8s on azure (small), gke (very big), rke2 on a setup with 5 nodes and k3s.

I'm totally lost why they would de k8s after investing already so much time? They should be able to work with k8s really well at this point.

Sry to say but for me it feels the company has a much bigger issue than cloud vs non cloud: no one with proper experience and strategy.


If one of the co-founders/owners is writing the devops tooling from scratch... well, that's a decision.

Not saying it's necessarily a bad decision. But it's potentially driven more by personal interests than a dispassionate and strategic plan.


Means either the company is doing super well or about to fail, no in-between here


I'm not sure how well 37Signals is doing these days - Hey didn't make as big an impact as they had hoped and Basecamp probably has a core of loyal users but I don't think it's getting a ton of new customers. They're small and could probably keep going until their founders decide to retire though.


Completely agreed with this. K8s is not that operationally hard. The concepts are all there and it's just giving you a framework for it essentially.

Many k8s operators out there to help you self host it. Abandoning the entire concept is just wild.


It does seem like they just moved all of their infra components, and got rid of autoscaling.

Load balancing, logging, and other associated components are all still there. Almost nothing changed in the actual architecture, just how it was hosted.

I have a hard time seeing why this was beneficial.


Cost, mostly.


Those k8s license fees will get ya.


You reckon they cannot afford running some vms for the k8s control plane?


They say they will save $7 million over 5 years with this migration.

Sounds worth it to me.

https://world.hey.com/dhh/we-stand-to-save-7m-over-five-year...


That answers my question, they can afford it if they wanted to. Obviously they don't want to. I'm in their camp when it comes to the cloud vs own hardware.


How much will the many extra employees they will need cost them in 5 years?


Zero, which is why we're not using k8s on-prem. Our team is already handling the on-prem hardware/software environment, and this will consolidate our apps on a single platform methodology, allowing us to keep the same team size. Using mrsk allows us reduce the complexity of our servers, moving that into the Dockerfile.

If we had gone down the k8s on-prem rabbit-hole, I suspect we would have required more folks to manage those components and complexity.


I don't understand how having k8s means you need significantly more people.

It's just concepts put into a strict system. Now you're just shimming the same concepts with less supported hacks. Now you have to train your team on less used technology that isn't transferable to other roles. Sounds like technical debt to me.


We're arguing about generic approaches and the 37Signals folks are making specific decisions about their very specific situation (their app, their staff having time or not, their budget, etc).

To be fair, they don't seem to be saying their strategy is for everybody but the audience thinks so? I think we're talking past each other, tbh.


For me it's still the 'nor getting what issue they had with k8s'.

And I would love to spend a few days with there team to understand it.


Maybe they'd rather pay coworkers than a giant tech company.


If I were small/mid company owner that would be perfectly good reason.

Which seems to be also the case for Basecamp/37signals.


Even if they had to hire two or three people to work on this full time (they won't) it'd still be a massive savings.


Depends on the quality and hr overhead.

Oncall, new features, security etc.

No small it department I worked with, was anywhere close to the features and quality of Google and co.


wow, TIL the founder of RoR races LMP2 in Le Mans

he may have just become my new hero


I feel like this is the reason so many horrible kubernetes stacks exist.


This is very amusing to read!

This company invented Ruby on Rails and was in business before ‘cloud’ was a thing. Many things can be said about 37signals and DHH in particular, but lacking proper experience is definitely not one of those.


I am very curious why they had such a bad k8s experience while mine is the total opposite.

And just inventing known stuff doesn't mean they are infra or platform experts.

I grew into cloud in the same time frame as they did and I'm also a platform expert and k8s expert.

So are they automatically better in deployment and platforms because they invented ruby on rails?


Answer to your question is cos peoples experiences differ, wildly in some cases.

Your account was created 18hours ago so I can’t see really what support there is for this specific throwaway account to be declared an expert in anything. Are you self proclaimed expert or world renowned expert? Since they are world renowned buch… :)

And yes, I’m expert on these things, trust me :)


I only create my accounts adhoc because I spend too much time on discussions otherwise.

But my argument was more in sense of contradicting the original argument. No one is an expert just because.

I for myself I'm a cloud architect in a very big company and have introduced a k8s based platform in two projects. One internal on gke and one in a opensource project.

Both are used by 15-20 teams.

I also run k8s at home for fun and in a small startup.

I'm probably doing primarily k8s for 5 years and was a software engineer before.


I can understand declouding, but getting rid of Kubernetes seems like a lot of work for little gain.


They’re not thinking strategically, they’re thinking tactically.


I have been saying this to my customers for a long time, most projects do not really benefit from K8s but on the contrary, it is a huge operational/cost compromise to use K8s for a monolith app that does simple CRUD operations where occasional downtime is actually acceptable.

In my last project, I removed the unnecessary complexity that K8s was bringing and went back to ansible scripts, which has worked nicely.

With another customer, we inherited a frontend application that was being deployed with K8s while vercel is a considerably simpler/faster alternative.

K8s certainly has its advantages but I'd bet that many projects using it do not gain much.


My impression is that it makes deploying your 100th server much easier, at the cost of making your first several much harder. If you're going to have 100+ servers, that's probably worth it. If you're not (and most companies aren't), then it's like getting your CDL so that you can go to the grocery store in a semi-tractor trailer, when you should have driven there in a compact car.


This seems like an application/stack that didn't have a valid need for k8s in the first place. Don't just use K8s because its what people say you should do. evaluate the pros and the VERY real cons and make an informed decision.


That's why we've had good results with ECS. Feels like 80% of the result for 20% of the effort, and I haven't found our use cases needing that missing 20%.


With EC2 launch types, probably. Setting up ECS for Fargate with proper IaC templates/modules isn't much easier than EKS, IMO.


Mostly because CF and CDK have spawned from the deepest pits of hell. It’s ok when using terraform, and downright pleasant when using Pulumi.


I recommend everyone take a look at ECS patterns. This is incredibly easy in $CURRENT_YEAR. Give it a dockerfile / image and CDK deploy:

https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_...


On the Google cloud side, using Google cloud build with cloud run with automatic CI/CD is very straightforward. I setup automated builds and deploys for staging in 2 hours. For production I set it up to track tagged branches matching a regex.


We use Fargate, and what we launch is tightly coupled to our application (background jobs spin down and spin up tasks via the SDK) so for now, we aren't doing anything with IaC, other than CI deployment.


When I had to set up ECS with Fargate using CloudFormation the documentation was certainly lacking (in late 2019 I think it was). Now that it's working it's been pretty low maintenance.


It has definitely gotten better over time, but we tend to do a lot of stuff ad-hoc that finds its way into production lol, so we aren't yet relying on any infra as code.


“Need” Eh, I do it because it’s awesome for a single box or thousands. Single sign on, mTLS everywhere, cert-manager, BGP or L2 VIPs to any pod, etc and I can expand horizontally as needed. It’s the best for an at home lab. I pity the people who only use Proxmox.


Even k3s used 500MB and 2-5% CPU for a single server node with 0 user space pods. That seems incredibly bloated to me.


Throughout my company’s pursuit of moving everything under the sun into AWS I have done my best to keep everything able to be migrated, we have some systems which are just, simply going to have to be completely rebuilt if we ever needed to move them off of AWS, because there is not a single component of the system that doesn’t rely on some kind of vendor lock-in system AWS provides.

I aim to keep everything I’m working on using the simplest services possible, essentially treating AWS like it’s Digital Ocean or Linode with a stupidly complex control panel. This way if we need to migrate, as long as someone can hand me a Linux VM and maybe an S3 interface we can do it.

I really just have trouble believing that everyone using Kubernetes and a bunch of infrastructure as code is truly benefiting from it. Linux sysadmin isn’t hard. Get a big server with an AMD Epyc or two and a bunch of RAM, put it in a datacenter colo, and maybe do that twice for redundancy and I almost guarantee you it can take you at least close to 9 figures revenue.

If at that point it’s not enough, congratulations you have the money to figure it out. If it’s not enough to get you to that point, perhaps you need to re-think your engineering philosophy(for example, stop putting 100 data constraints per endpoint in your python API when you have zero Postgres utilization beyond basic tables and indexes).

If you still really genuinely can’t make that setup work, then congratulations you are in the 10%(maybe) of companies that actually need everything k8s or “cloud native” solutions offer.

I would like to note that given these opinions, I do realize there are problems that need the flexibility of a platform like AWS, one that comes to mind is video game servers needing to serve very close to a high number of geographic areas for latency concerns.


To play the devil's advocate here:

> I aim to keep everything I’m working on using the simplest services possible, essentially treating AWS like it’s Digital Ocean or Linode with a stupidly complex control panel.

What's the benefit of AWS then, if you're not using any of the managed services AWS offers, and are instead treating AWS as an (overly expensive) Digital Ocean or Linode?


I’m arguing there’s not a benefit, it’s just the service I have to use for reasons outside of my control.


Wow. "K8s is simple", it has the same vibes as Linux user vs Dropbox:

'...you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem' https://news.ycombinator.com/item?id=8863

https://www.theregister.com/2021/02/25/google_kubernetes_aut...

I love HN's disconnection from reality.


It's not that Kubernetes is simple (it's not), but Kubernetes is relatively simple compared to the task it accomplishes.

If you have containers that need to be scheduled and networked and supplied with storage and exposed to the internet across a large set of machines past the scale where you can easily do so with tools like docker-compose, Kubernetes (might) be for you. There's a good chance it will be simpler to understand and reason about than the homegrown kludge you could make to do the same thing, especially once you understand the core design around reconciliation loops.

That said, you might not need all that, and then you probably shouldn't use Kubernetes.


It's almost as if...

...kubernetes isn't the solution to every compute need...


Tell that to the myriad of folks making their money off of peddling it. You'd swear it were the only tool available based on the hype circles (and how many hiring manager strictly look for experience with it).


Cloud Native Landscape.... https://landscape.cncf.io/images/landscape.pdf

It's more than just peddlers at this point. There are peddlers peddling to other peddlers, several layers deep.


I gotta say from dev perspective it is very convenient solution. But I wouldn't recommend it to anyone that runs anything less complex than "a few services in a database". The tens of minutes you save in writing deploy scrips will be replaced by hours of figuring out how to do it k8s way.

From ops perspective let's say I ran it from scratch (as in "writing systemd units to run k8s daemons and setting up CA to feed them", because back then there was not that much reliable automation around deploying it) and the complexity tax is insane. Yeah you can install some automation doing that but if it ever breaks (and I've seen some breaking) good fucking luck, non-veteran will have better chance with reinstalling it from scratch.


Except it was created to model virtually every solution to every compute need. It’s not about the compute itself, it’s about the taxonomy, composability, and verifiability of specifications which makes Kubernetes excellent substrate for nearly any computing model from the most static to the most dynamic. You find kubernetes everywhere because of how flexible it is to meet different domains. It’s the next major revolution in systems computing since Unix.


I (roughly) believe this as well[0], but more flexibility generally means more complexity. Right now, if you don't need the flexibility that k8s offers, it's probably better to use a solution with less flexibility and therefore less complexity. Maybe in a decade if k8s has eaten the world there'll be simple k8s-based solutions to most problems, but right now that's not always the case

[0] I think that in the same way that operating systems abstract physical hardware, memory management, process management, etc, k8s abstracts storage, network, compute resources, etc


Always two extremes to any debate. I've personally enjoyed my journey with it. I've even been in an anti-k8s company running bare metal on the Hashi stack (won't be running back to that anytime soon). I think the two categories I've seen work best are either something lik ECS or serverless and Kubernetes.


I think many have lost the main point of this post: 37signals seems to be a successful tech company, and their employees are having fun.

Very few tech companies can say that. So, it doesn't really matter if they ditch k8s for their own solution. Really, it doesn't.


De-clouding is going to be a huge trend as companies are pressured to save costs, and they realize on-prem is still a fraction of the cost of comparable cloud services.

This whole cloud shift has been one of the most mind-blowing shared delusions in the industry, and I'm glad I've mostly avoided working with it outright.


The thing that gets me about it is the very real physical cost of all this cloud waste.

The big cloud providers have clear cut thousands of acres in Ohio, Northern VA, and elsewhere to build their huge windowless concrete bunkers in support of this delusion of unlimited scale.

Hopefully as the monetary costs become clear their growth will be reversed and these bunkers can be torn down


The big cloud providers are likely packing machines more densely and powering them more efficiently than alternatives like colos.


Much more than efficient. You think AWS is getting the same CPU normal civilians get? No way dude. Those guys are big enough that they can get custom hardware just for their specific needs. They’re cooling systems, power systems, everything is way more efficient. And they are big enough they can afford to measure every single metric that matters and optimize every one.


They ain't going to be unused lmao. If migration happen they will just stop building new ones or have to compete harder on pricing.


For what it's worth, large providers will always need datacenters. But perhaps datacenters run by public cloud providers today will be sold off to larger businesses running their own infrastructure someday at a discount. Most of the infrastructure itself all will age out in five or ten years, and would've been replaced either way.

Heck, datacenters in Virginia are likely to end up being sold directly to the federal government.


Our firm started the big cloud initiative last year. We have our own datacenters already, but all the cool startups used cloud. Our managers figure it'll make us cool too.


This sort of thing is absolutely insane. Like, sure, small office, no existing datacenter infrastructure, it might make sense to bootstrap your business on someone else's cloud. But if you literally have a cooled room and an existing network infrastructure, it's absolutely silly to spend money on using someone else's.


On-prem has its own issues. Many small applications need little more than a VPS and a sane backup/recovery strategy.


Something I feel like these conversations seem to miss is that it is not binary; you don't have to host hardware on-prem if you don't want to be in AWS. There are other clouds. There are Sungards of the world were you can pay for racks of managed hardware. There are a lot of options between buying and managing your own hardware and AWS.


Good for them. Now they have a one-off to manage themselves. It’s pretty easy to de-cloud using something like k3s. So much value added in Kubernetes to leverage. But they have Chef and they’re a Ruby shop, I guess they’ll be good.

TBH, Kubernetes has some really rough edges. Helm charts aren’t that great and Kustomize gets real messy real fast.


Chef kind of scares me. It's such a tarpit, and it felt like abandonware five years ago


The scope of their self-developed tool doesn't seem very large, it looks like it could be a wrapper around SSH. I've done similar things using a SSH library with python to deploy and run docker-compose yamls on multiple servers.


There are many of these tools out there. When I was working for Technicolor Virdata some years ago, we’ve been heavily invested in https://github.com/infochimps-labs/ironfan. It was extensible, we had support for SoftLayer and IBM SCE, we had some patches to make the bootstrap and the kick command perform faster. But it was still slow and people didn’t like Ruby (I don’t mind it).

Even back then I wasn’t a fan of doing a proactive ssh connection to the node. I always leaned towards the machine pulling artefacts and deploying them. Like Flux CD does.


> cluster-autoscaler

> ingress controllers

> storage drivers

> external-dns

> node termination handlers

> complex networking concepts around VPN, peering, route tables, NAT, …

> where DNS is handled

> It also misses the entire sphere around identity and access management for those resources that also needs to be maintained

Well, how is this all solved with their new tooling? Like they describe a whole huge complicated problem space and then write a tool for the simplest part of it: deploying an app. :shrug:


We use k8s to run the app both on AWS, on our own hardware in a few datacenters (in countries with strict personal data laws) and on clients' own servers as well (something like the banking sector or a jewelry company, i.e. companies which don't trust the cloud).

From what I heard, AWS is the most stable and easiest to work with of all; the servers which run on our own hardware have more outages and our SRE team often needs to make trips in person to the datacenters to replace hardware etc. Clients' hardware is the faultiest (unsurprisingly). Ideally we'd rather host everything on AWS :)


The thing I noticed is that they are not using any other AWS services. No S3, Elasticache, DynamoDB, etc. They are just running applications and databases.

This will not be the case with many people using cloud and a migration to bare metal will be much harder. Each of those services needs an equivalent to be deployed and managed and it's features might be up to what the AWS equivalent has.

Even the stuff that they are moving (databases, load balancers, etc) is significant operational overhead. In AWS database fail-over is an option you tick. Self hosting has whole books written about how to do database high availability.


For their more complex products they do use RDS, Elasticache, etc. so that will be a whole other adventure.


And don't get me wrong: whatever works for the company but k8s experience alone is already super helpful.

A lightweight k8s stack out of the box + argocd + cert manager is like I fra steroids.


The whole kubernetes section of this writeup is two sentences. They went with a vendor provided kube & it was expensive & didn't go great.

It just sounds like it was poorly executed, mostly? There's enough blogs & YouTube of folk setting up HA k8s on a couple rpi, & even the 2GB model works fine if having not-quite-half the ram as overhead on apiservers/etcd nodes.

It's not like 37signals has hundreds of teams & thousands of services to juggle, so it' s not like they need a beefy control plane. I dont know what went wrong & there's no real info to guess by, but 37s seems like a semi-ideal easy lock for k8s on prem.


Or you could look at it like this:

It's not like 37signals has hundreds of teams & thousands of services to juggle, so it's not like they need k8s and the complexity it brings.


So they wrote their own bespoke replacement.

It seems like a lot of effort to do less. Hopefully it helps others too I guess. But it feels like a problem space with a lot lot of inherent complexity, that's liable to expand over time, & there is a very high skepticism I'd have to folks who opt to greenfield it all.


Sure, there is some inherent complexity, but by writing their own tool, they get to choose exactly how to handle the complexity for their particular use case, instead of having it dictated by a general-purpose tool developed by a consortium of US corporations. I consider that a win.

If they have the manpower and expertise to do that, more power to them!


Wow, uh, this is just such a sad short statement. It's just so woefully out of touch, so baselessly derogatory.

Kube is mostly a pretty generic idea, that greatly empowers folks to write their own stuff. There are dozens of gitops systems. There are hundreds of event-based systems. They almost all have some Custom Resources registered in API Server, but that's because it's good & doesn't encumber anyone. Beyond that it feels like the sky is the limit.

There are some deeper kube things. There's a Scheduler Framework that has a huge framework on extensibility, on modular plugins, to create huge flexibility to make this general.

This zeal, this desire to feel oppressed, this righteousness of rebellion: I wish it also could reflect & understand options & cooperation & possibility, see how a lot of the terrifying forces out there don't want us all consigned to narrow fixed paths. More people than you a knowledge want to potentiate & enrich. The goal of these efforts is anything but to dictate to us how we do things, and it's so easy, so simple to see that, to explore how flexible & varied & different t these world class cluster operating systems we're working on together are and how they hp us accomplish many different ends, how they help us explore new potetential ends.


On one hand, yes, in theory k8s is pretty extensible. In practice, though, you always end up being forced to do things you do not want or need to do, or being prevented from doing things you want to do, because of vendor specifics. Sometimes that is an acceptable tradeoff, sometimes not.

Plus, it is always good to take a step back and appreciate that monoculture is a bad thing in computing. We always need more different approaches, viewpoints, solutions to the same problems. Should everyone roll their own? Of course not - that's why I mentioned having sufficient manpower and expertise to do that.

We should be applauding having more choices and cheering, not scolding those who strive to provide them.

As for your last paragraph, I completely agree, we need to share the knowledge and cooperate. But expecting corporations to "potentiate & enrich" us is rather naive. They will play nice only as long as they need to, and the minute their financial incentives do not align with sharing, they will do their best to pull the rug from underneath everybody else. Even their sharing phase is only to build levers to use in the future. We've seen it over and over and over for the past several decades, with Oracle, SCO, Microsoft, Apple, Google, ... heck, I could pretty much list all big companies.


Seems like the name of their container shipping tool Mrsk is inspired by Maersk a leading container shipping company from Denmark.


Plot twist: all of this is just a very clever Maersk marketing campaign.


indeed - the presenter (dhh) is Danish, and there was a photo of a container ship near the beginning of the demo video.


So as an industry we've been having some version of this debate (at FB we were having it at least as far back as 2014, my org was IIRC the first big one to test-drive our Borg-alike container solution).

These days I think maybe it's just that classic dilemma: over-design and over-build to be ready for contingencies, or build just what we know we need and maybe get caught with our slacks down. This goes by a zillion names WET vs DRY, YAGNI, microservice vs monolith, there are countless variations on the same core idea.

If you start with PHP and MySQL and a chain-smoking sysadmin, and you get hit with hyper-growth then you adapt or die, and you have a mountain of data to figure it out. This is paradoxically an easier decision tree (IMHO) even if maybe some of the engineering is harder or at least higher-stress.

But by far the more common case is that we're building something that isn't huge yet, and while we hope it goes huge we don't actually know if it will: should we build more features and kinda wing it on the operability/economical/automated can of worms, or should we build for the big time from day one?

I think it's a legitimately hard set of questions and reasonable people can disagree. These days I think the only way to fully screw it up is to get ideological rather than pragmatic about it.


A lot of people are kind of missing the forest for the trees here. Ignore the fact that what they're doing is probably a terrible idea for most other people. If it works for them, that's fine. It might only work for them, and that's fine.

Don't paint your bike shed orange just because somebody famous painted theirs orange. They have their reasons. Paint yours whatever color works best for you, for your own reasons.


Only if DHH would not be posting marketing pieces where it seems that if you also paint your shed in orange it will save you $8mil in 5 years.

It is also that DHH is somewhat renowned in software world so it is also not like Pedro Pascal is telling you to do that.

Yes it is marketing and boasting but it is not so easy to see forest for the trees here.


It's anecdotal but the sentiment I have is that the Kubernetes ecosystem drains an even bigger part of the collective effort required to provide business value. I believe many engineers have a disconnect on what it means to provide real business value. Solutions like Kubernetes are designed to be able to accommodate an endless number of scenarios out of which you probably only need 1 to provide value for your business. The consequence is that there's a disproportionate ratio of Kubernetes possibilities hence complexities vs. the simplicity of your requirements. Once your workload runs on Kubernetes, you cannot afford to ignore the complexities of Kubernetes so you are automatically sucked into the rabbit hole.


They could just use nomad and call it a day


They could, but instead they're doing something closer to static scheduling. They have a small set of applications and a lot of visibility into what their needs are going to be, so the complexity of a dynamic scheduler might not pay its own freight in their environment.

I like Nomad a lot and it's what I would use if I were migrating a "halfheartedly" K8s application to on-prem metal, but I couldn't blame someone who felt burned by K8s complexity for not investing in another dynamic scheduler.


K8s, Docker and AWS/GCP/Azure are to ops what React is to web development, ie. rarely the appropriate tool for the job. Trouble is you now have a generation of devs who have no experience with anything else.


At one of my former workplaces we ran Kubernetes on premises and it worked like a charm. I still think that Kubernetes can be a good fit for microservices even if you use your own hardware.


I see this as an instance of a company discovering value well-described in this article from 2021:

"The Cost of Cloud, a Trillion Dollar Paradox"

https://a16z.com/2021/05/27/cost-of-cloud-paradox-market-cap...


I think it is cool they are developing new tooling. I don’t understand all the negativity. Isn’t it good that people keep innovating in this space?

Also how is this different from deploying before k8s and TF etc was a thing? We would write our own scripts to manage and deploy our servers. This is the same no? Just bit more structured and it has a name.


37signal folks love to put a spin on anything they do as if it's ground breaking or super innovative... but it rarely is. In particular, they love to take a contrarian position. Like their books, there really isn't anything interesting written here.


I'm not going to put this down, because it sounds like they're quite happy with the results. But they haven't written about a few things that I find to be important details:

First, one of the promises of a standardized platform (be it k8s or something else) is that you don't reinvent the wheel for each application. You have one way of doing logging, one way of doing builds/deployments, etc. Now, they have two ways of doing everything (one for their k8s stuff that remains in the cloud, one for what they have migrated). And the stuff in the cloud is the mature, been-using-it-for-years stuff, and the new stuff seemingly hasn't been battle-tested beyond a couple small services.

Now that's fine, and migrating a small service and hanging the Mission Accomplished banner is a win. But it's not a win that says "we're ready to move our big, money-making services off of k8s". My suspicion is that handling the most intensive services means replacing all of the moving parts of k8s with lots of k8s-shaped things, and things which are probably less-easily glued together than k8s things are.

Another thing that strikes me is that if you look at their cloud spend [0], three of their four top services are _managed_ services. You simply will not take RDS and swap it out 1:1 for Percona MySQL, it is not the same for clusters of substance. You will not simply throw Elasticsearch at some linux boxes and get the same result as managed OpenSearch. You will not simply install redis/memcached on some servers and get elasticache. The managed services have substantial margin, but unless you have Elasticsearch experts, memcached/redis experts, and DBAs on-hand to make the thing do the stuff, you're also going to likely end up spending more than you expect to run those things on hardware you control. I don't think about SSDs or NVMe or how I'll provision new servers for a sudden traffic spike when I set up an Aurora cluster, but you can't not think about it when you're running it yourself.

Said another way, I'm curious as to how they will reduce costs AND still have equally performant/maintainable/reliable services while replacing some unit of infrastructure N with N+M (where M is the currently-managed bits). And also while not being able to just magically make more computers (or computers of a different shape) appear in their datacenter at the click of a button.

I'm also curious how they'll handle scaling. Is scaling your k8s clusters up and down in the cloud really more expensive than keeping enough machines to handle unexpected load on standby? I guess their load must be pretty consistent.

[0] https://dev.37signals.com/our-cloud-spend-in-2022/


> First, one of the promises of a standardized platform (be it k8s or something else) is that you don't reinvent the wheel for each application. You have one way of doing logging, one way of doing builds/deployments, etc.

You can also hire people with direct relevant experience with these tools. You have to ramp up new developers to use the bespoke in house tooling instead.


I think the entire scaling thing is a bit like "manual memory management vs GC", both have advantages and disadvantages.


Yes and no. Different types of memory management essentially accomplish the same thing. The way you build for them and their performance characteristics vary. In that way, scaling is the same.

But scaling is different in that you're physical ability to scale up with on-prem is bounded by physically procuring/installing/running servers, whereas in the cloud that's already been done by someone else weeks or months ago. When you shut off on-prem hardware, you don't get a refund on the capex cost (you're only saving on power/cooling, maybe some wear and tear).

It's not just that you need to plan differently, it's that you need to design your systems to be less elastic. You have fixed finite resources that you cannot exceed, which means even if you have money to throw at a problem, it doesn't matter: you cannot buy your way out of a scaling problem in the short-medium term. If you run out of disk space, you're out of disk space. If you run out of servers with enough RAM for caching, you're evicting data from your cache. The systems you build need to work predictably weeks or months out, and that is a fundamentally different way of building large systems.


This is it, and what so many anti-cloud people are missing. For start ups, how can you possibly take a gamble on trying to predict what your traffic is going to be and paying upfront for dedicated servers. It puts you in a loose-loose situation - your product is not the right fit, you've got a dedicated server you are not using. Your product is a success - well now you need to go and order another server, better hope you can get it spun up in time before everything falls over. I worked at a startup where we saw 1000x increase in load in a day due to a customer's app going viral. On prem would have killed us, cloud saved us.

And you are bang on about managed services. RDS is expensive no doubt, but having your 4 person dev team burn through your seed round messing around with database back ups and failover is a far higher cost.

Of course some companies grow out of the cloud, they have full time ops engineers and can predict traffic ahead of time - for sure, go back to on prem. But for people to hold up articles like this and say "I always said cloud was pointless!" is just absurd.


OK, if you don't want to get good at planning as a company, that's fine. It's OK, just please don't pretend that it's impossible.

I worked at a startup that did the crazy scaling with physical servers just fine. No problem. The marketing department knew ahead of time when something was likely to go viral, IT/Dev knew how much capacity was needed per user and procurement knew lead time on hardware + could keep the vendors in the loop so that hardware would be ready on short notice.

With good internal communication it really is possible to be good at capacity management and get hardware on short notice if required.

Normally we would have servers racked and ready about 2 weeks after ordering, but it could be done in under half a day if required.

Edit: (we had our own datacentre and the suppliers were in a different state)


> The marketing department knew ahead of time when something was likely to go viral

That's fine when it's your product. The situation I'm talking about was a SaaS product providing backend services for customers app. Our customers didn't know if their app was going to go viral, there is no way we could have known. I maintain on-prem would have been totally inappropriate in this situation.

Also, "the marketing department knew ahead of time when something was likely to go viral"...that is quite a statement. They must have beeen some marketing department.


Eh... I used to think they were just a normal marketing department. I have since learned that they were good and most places have bad ones.


Depending on your business use-case, sharing a pool of IPs can have detrimental impact on access. For example, you may find the prior users were doing unauthorized security scans, spamming email, or hosting proxies.

i.e. if you get an IP block with a bad reputation, than you may find running a mail or VoIP server problematic.

If you are running purely user centric web services, than it doesn't matter as long as you are serving under around 30TiB/month.

There is also the issue of future offline decryption of sensitive records without using quantum resistant storage cryptography.

Rule #4: The first mistake in losing a marathon is moving the finish line. =)


Sounds to me like 37signals uses the risk aversion paradigm typical for stagnating businesses — instead of building and refining their strengths they're fixated on mitigating their weaknesses.


I've been following their move to on premise with interest and this was a great read. I'm curious how they are wiring up GitHub actions with their on premise deployment. How are they doing this?

The best I can think of for my own project is to run one of the self hosted GitHub actions runners on the same machine which could then run an action to trigger running the latest docker image.

Without something like that you miss the nice instant push model cloud gives you and you have to use the pull model of polling some service regularly for newer versions.


They mentioned their mrsk tool sshes to the boxes to deploy so the action probably runs the tool and does just that?


Negative. No external tool/company has ssh access. GHA is strictly for CI, which is decoupled from the actual deploy.

If we do decide to tie it in, it will be using the GH Deployment API to inform the local tool on CI status or something.


What do you do then if you don't mind me asking? I see this problem time and time again for self hosting and and using CI/CD - and every time it seems to either come down to exposing SSH, polling for new versions, or running the github action runner on the same machine as the app or service.


K8s has some cognitive overhead. For simple deploy a docker client-server with docker-compose is a winner, see misterio[1] which basically leverage docker-compose +ssh

But when you need to guarantee system will auto-restart, and healthchecks and so on, K8s is the de-facto standard.

Helm template language (based on Go) is not ideal, but it difficult to replace K8s nowadays with simpler systems.

[1] https://github.com/daitangio/misterio


I wonder how they are going to handle fault tolerance when machines go offline?

So as to avoid being paged in the middle of the night, I grew to really like automation that keeps things online.


Their app is running on at least 2 machines, so the load balancer takes care of it.


Timing is perfect, several of my clients fell through Series B and now look for cutting cloud costs (all way over provisioned for their traffic and customer numbers).


They're saying a VM takes seconds to boot up, yeah only because they run static dedicated servers, of course in the cloud if you wait for the VM to come online it's going to take longer, now how long does it takes them to add a new dedicated server and to add it to that pool of servers? days?

The other main issue I see is that they use chef and mrsk to setup applications, how is Filebeat setup? Is it chef that set it up or is it mrsk?


I started my career at a company that was excellent at capacity management and prediction. Using physical hardware they never hit a capacity problem, ever, despite growing like crazy. This did require the Marketing department being in close communication with the IT department about upcoming campaigns.

Everywhere else though have been terrible at predicting future capacity needs. As far as I can tell that's because they just use tools that gives a prediction based only on historical growth.

I guess my point is that it's entirely possible to be good at capacity management, and if you are then the lead time disadvantage of physical hardware can be completely negated.


From my experience with static physical capacity you're always way over provisionned, because just in case and how long it takes to get servers.


It's easier to massively over provision or use the cloud than it is to get good at capacity planning. Same as how it's easier to use a GC than it is to do manual memory management.

They are all valid strategies, the key is picking the one that suits your situation.

If you need a small to medium amount of resources then the cloud is likely the cheapest option.

If you need a medium to high amount of resources then massively over provisioning can still be cheaper than using the cloud.

The cheapest option for anything medium size and above is physical servers with good capacity management.

Good capacity management requires good internal communication between business units. And making predictions based on expected/planned events not just historical data.


> now how long does it takes them to add a new dedicated server and to add it to that pool of servers? days?

Depending on provider and your config it could take minutes to hours these days. It also could take months if you’re ordering your own hw


Is this cheaper though?

For a medium-to-large app, K8s should offset a lot of the operational difficulties. Also you don't have to use K8s.

Cloud is turn-on/turn-off, whereas on-premises you pay up front investment.

Here are all of the hidden costs of on-prem that folks forget about when thinking about cloud being "expensive":

- hardware

- maintenance

- electricity

- air conditioning

- security

- on-call and incident response

Here are all of the hidden time-consumers of on-prem that folks forget about when thinking about cloud being "difficult":

- os patching and maintenance

- network maintenance

- driver patching

- library updating and maintenance

- BACKUPS

- redundancy

- disaster recovery

- availability


We have 7 racks, 3 people and actual hardware stuff is minuscule part of that. Few hundred VMs, anything from "just a software running on server" to k8s stack (biggest one is 30 nodes), 2 ceph cluster (our and clients), and a bunch of other shit

The stuff you mentioned is, amortized, around 20% (automation ftw). The rest of it is stuff that we would do in cloud anyway and cloud is in general harder to debug too (we have few smaller projects managed in cloud for customers.

We did calculation to move to cloud few times now, never was even close to profotable and we woudn't save on manpower anyway as 24/7 on-call is still required.

So I call bullshit on that.

If you are startup, by all means go cloud

If you are small, go ahead, not worth it.

If you have spiky load, cloud or hybrid will most likely be cheaper.

But if you have constant (by that I mean difference between peak and lowest traffic is "only" like 50-60%) load and need a bunch of servers to run it (say 3+ racks), it might actually be cheaper on-site.

Or a bunch of dedicated servers. Then you don't need to bother to manage hardware, and in case of boom can even scale relatively quickly


This is the fiction that CTOs believe - "it's simply not practical to run your own computers, you need cloud".


Every one of your examples in the second list is relevant to both on-prem and cloud. Also cloud also has on-call, just not for the hardware issues (still likely get a page for reduced availability of your software).


The problem here is “cloud” can mean different things.

If you’re taking about virtual machines running in a classical networking configuration then you’re not really leveraging “the cloud” — all you’ve done is shifted the location of your CPUs.

However if you’re using things like serverless, managed databases, SaaS, then most of the problems in the second list are either solved or much easier to solve in the cloud.

The problem with “the cloud” is you either need highly variable on-demand compute requirements or a complete re-architecture of your applications for cloud computing to make sense. And this is something that so many organisations miss.

I’ve lost count of the number of people who have tried to replicate their on-prem experience to cloud deployments and then came to the same conclusions as yourself. But that’s a little like trying to row a boat on land and then saying roads are a rubbish way to filter traffic. You just have to approach roads and rivers (or cloud and on-prem) deployments with a different mindset because they solve different problems.


Yeah, but you still need alerts to see if your lambda breaks. But yes, the managed solutions save a lot of time and effort.


Absolutely. Observability is paramount regardless of where and how your application runs.


This is simply not true unless you build in the cloud the same way you build on prem and just have a bunch of VMs. PaaS services get you away from server / network / driver maintenance and handle disaster recovery and replication out of the box. If you're primarily using IaaS, you likely shouldn't be in the cloud unless you're really leveraging the bursting capabilities.

https://robertgreiner.com/content/images/2019/09/AzureServic...


“Just not for the hardware issues” is a huge deal though. That’s an entire skillset you can eliminate from your requirements if you’re only in the cloud. Depending on the scale of your team this might be a massive amount of savings.


At my last job, I would have happily gone into the office at 3am to swap a hard drive if it meant I didn't have to pay my AWS bill anymore. Computers are cheap. Backups are annoying, but you have to do them in the cloud too. (Deleting your Cloud SQL instance accidentally deletes all the automatic backups; so you have to roll your own if you care at all. Things like that; cloud providers remove some annoyance, and then add their own. If you operate software in production, you have to tolerate annoyance!)

Self-managed Kubernetes is no picnic, but nothing operational is ever a picnic. If it's not debugging a weird networking issue with tcpdump while sitting on the datacenter floor, it's begging your account rep for an update on your ticket twice a day for 3 weeks. Pick your poison.


> At my last job, I would have happily gone into the office at 3am to swap a hard drive if it meant I didn't have to pay my AWS bill anymore

This seems foreign to many people, but I’d happily take on this responsibility if I get the attendant benefits.

Also incentivices me to make things robust enough I never have to.


The flip side is there is an entirely new skillset required to successfully leverage the cloud.

I suspect those cloud skills are also higher demand and therefore more expensive than hiring for people to handle hardware issues.

Personally, I appreciate the contrarian view because I think many businesses have been naive in their decision to move some of their workloads into the cloud. I'd like to see a broader industry study that shows what benefits are actually realized in the cloud.


Right. The skillset to pull the right drive from the server and put replacement one.

Says that you know nothing at all about actually running hardware as the bigger problem is by far "the DC might be drive 1-5 hour away" or "we have no spare parts at hand", not "fiddling with server is super hard"


Kubernetes is an amazing tool. Cloud computing is a powerful way to leverage a small team and prototype stuff quickly.

Ocean going ships are impressive pieces of kit. CNC machine tools are a powerful way to leverage small teams and manufacture high quality stuff quickly.

Now, telling every repair business in town they need robotic lathes and a fleet of major cargo ship is nonsense.

Why this kind of discourse thrives in software is beyond me.


Omg, they could have literally just solved all of this by choosing Cloud Run (a thin wrapper on top of Knative) rather than running GKE directly.


I so wanted to read the page but after 10 seconds of the high contrast white text on pure solid black background I started developing a headache.


Yes, very weird, I got kind of dizzy.


Because I have some light sensitivity issues, I use browser extensions including Dark Reader and Midnight Lizard to enforce my own 'dark mode' across the web.

You can also use extensions like that to set the contrast to a more comfortable level on websites that are already dark.

I highly recommend this if you have light sensitivity issues like me.

Also note that when the contrast on a page is higher, you can generally get away with lower brightness. This is pretty convenient on phones, and probably more necessary as well since on a phone you're more likely to have an OLED screen that really surfaces extreme contrast like white on black.


There are some great web extensions for a lot of things. I don't use any of them because most of them require permissions to read data across all sites, which makes sense for them to work; but I'm not using any of them.


Fair enough. I only use long-lived, open-source browser extensions for that kind of global restyling. But of course there's still a risk that they could be compromised somehow.


37 Signals has many technical people and can afford to de-k8s. But K8s is designed for a totally different use case, for large corps where there are most non IT staff but where IT resources and standards need to be managed in a more central way. Most standard banks or large companies do not want to roll this stuff by hand, they care about STANDARDS!


That's the thing with technology though, it goes mainstream as adoption grows. RoR started small at 37Signals and eventually became a standard. MSRK might yet be one, there's no telling right now.


Cloud is important for those who need it.

We did run a fast growing startup (sometimes 100% MoM jumps), 5M active users with 50k concurrent users (not visitors) with DB writes on 6 machines + 2 DB servers and $100M ARR 10 years back. If you're this size, MRSK makes total sense.

If you're much larger or growing >50% MoM continuously, K8 in the cloud makes more sense.


If only Ruby had real concurrency and the memory didn't bloat like crazy.... you wouldn't need 90% of the hardware.



Neither cloud or on-prem fits everyones requirements, but at the end you need to know your environment well. One think i like from ECS and fargate is that you could use projects like my_init and get a container behave closer to a vm (run ssh, and other daemons at the same time).


To me the really surprising thing is that they still use Capistrano for deploying basecamp!


Maybe because Capistrano is written in Ruby and the language matches their internal products? That was my only guess.


I was guessing that they kept using Capistrano because it still worked. No need to change something that’s working…

(Somewhat of an ironic comment when talking about an article about ditching K8s…)


That's a name I haven't heard since I was deploying my Jekyll blog during my apprenticeship at Thoughtbot. Wow.


After reading tfa - I'm actually a bit confused about MRSK. I'm not sure I'd want to run MRSK and Chef? They seem more orthogonal than complementary?

Ed: at least between Capistrano and Chef it seems that MRSK is redundant?


It’s pretty sad that both Google and AWS price their on premise versions of their hardware at the same price as if you were running in their cloud.

Makes it totally a non-starter to be a universal platform.


Larry Ellison was right.


I think there is a space between managed K8s on the cloud and e.g Ansible managed docker deploys on-prem. I'm curious to see how it pans out.


How does the overall cost of infrastructure compare, at the end of something like this? How fast does the move back pay off in terms of ROI?


Online deployments are discussed with some frequency. Tooling is talked about. Always as a "cluster". Why do we need clusters anymore? Scaling containers, scaling functions, scaling, scaling, cluster, cluster. We suffer so much tunnel vision about horizontal scaling when it's just unnecessary anymore for most applications. The cloud products are all about horizontal.

Do you really need more than the 400 threads >12TB ram with PBs of storage found in a reasonable high end server?


Getting a real dokku vibe from this. Although, currently limited to one container per machine, which is a huge limitation.


well... in my book, k8s has always been in "Dinosaur" category. somewhat useful, somewhat versatile, perhaps even good. quick glance on documentation eradicates any desire to learn the tech


How easier is mrsk vs k3s?


Looks like an apple vs oranges comparison. They seem to have a low number of distinct services, so there isn't a real need for k3s/k8s (ie orchestration), on the other hand, they need config management.


I'm not sure if anyone other than 37Signals is using it at scale yet, so you may get a better idea by looking at the docs yourself.

https://github.com/mrsked/mrsk


Oh, YAML-based DSL to deploy stuff, how original!

Now we only need template-based generator for those YAMLs and we will have all worst practices for orchestration right here, just like k8s + helm


have they thought to just run openstack on their own servers? everything I saw leads to me to use saltstack + openstack as they dont wanna be on cloud.


Have they tried Swarm?


I have to imagine part of the reason they need to run so many servers is because they are running Ruby. The same application on say, Elixir, probably would require less hardware, reducing the cost of ECS or similar.


How much users/traffic does Tadalist have?


Why are there VMs?


If I was Netflix I would de-cloud, but if I was a small team like 37signals then de-clouding is just insanity. I think DHH is either very stupid or extremely naive in his cost calculations or probably a mix of both. Hey and Basecamp customers will see many issues in the next few years and hackers will feast off their on-premise infrastructure.


They’ve had non-cloud infrastructure for a very long time. Their new orchestration methods notwithstanding, reliability and security are unlikely to suffer.


We used to call this a “sudden outbreak of common sense”


On-prems is wherw companies goes to die.


I find it very interesting that every conversation around k8s turns into a flame war between "just use k8s" and "no you don't need k8s at all". in reality, it is probably more of a spectrum than a boolean value. Also, it seems like people have a different definition on "using kubernetes":

* manage your own k8s cluster on your own hardware: probably pretty hardcore, I've never done this, I'd imagine it'd require me to know about the underlying hardware, diagnose issues and make sure the computer itself is running before managing k8s itself. only when the hardware is running properly I can focus on running k8s, which is also operationally expensive as well. Tbh I don't see a reason for a small/mid scale product to go this route unless they have a very specific reason.

* manage your own k8s cluster on cloud hardware: this seems like a bit simpler, meaning that I don't actually need to know much about running/managing hardware, that's what the provider does for me. I have done this before with k3s for some small applications, I have 2 small scale applications running like this for ~2 years now on Oracle's free ARM instances, I don't really do any active work/maintenance on them and they are running just fine. I'd probably have a lot of trouble if I wanted to upgrade k3s version for large scale applications, or usecases that have tight SLAs.

* use a managed k8s offering from a cloud provider: I've been doing this one the most, and I find it the easiest way to run my applications in a standardized way. I have experience in running applications on this setup for mid-scale as well as multi-national large scale consumer facing applications. Admitted, even though scale has been big, applications themselves have been mostly CRUD APIs, RabbitMQ / Kafka consumers and some scheduled jobs most of the time.

The trick seems to lie in the word "standardized" here: it is probably possible to run any application on any sort of hardware/orchestration combination, and MRSK could be a really nice solution for that as well. However, in my personal experience I have never managed to find an easier way of running multiple full applications, e.g. things that have multiple components such as web APIs, async workers, etc, in a standardized, replicable way.

I run the following components in one of my cloud-managed k8s clusters: - Vault - A few Laravel applications - A few Golang APIs - Grafana & Loki - Metabase

Using k8s for situations like this where the specific requirements from the underlying infra is not very complex actually enables a lot of experimentation / progress simply thanks to the ecosystem. For all of these components there are either ready-made Helm charts that I can simply run a `helm install` and be 90% there, or it is trivial to build a simple K8s deployment configuration to run them. In my experience, I couldn't find anything that comes closer to this experience without having a large engineering team dedicated to solve a very specific problem. In fact, it has been pretty chill to rely on the managed k8s offerings and just rely on my applications.

It's a spectrum: there are a billion cases that don't need k8s, and there are probably a similar amount that could actually benefit from it. There's no absolute truth to it other than the fact that k8s is actually useful for certain cases and it is for sure not always "resume driven development". This doesn't mean that we shouldn't be looking out for better approaches, there's probably a lot of accidental complexity around it as well, but we could also acknowledge that it is actually a useful piece of software.

I don't know, I feel like I have to pick sides every time these sorts of stuff is being discussed as if there is an objective truth, but I am fairly convinced these days that there is a middleground that doesn't involve fanaticism in either direction.


mrsk introduction screencast by DHH explains the process well, and is pretty impressive as usual: https://m.youtube.com/watch?v=LL1cV2FXZ5I


I've never read such a ridiculous article. I really wanted to give them the benefit of the doubt but good lord. How is any of this simpler or better? It's like they prefer the pain of 2004 mixed with the worst parts of modern infrastructure. The dream of the 2000s really is alive in DHH's head, isn't it?

TFA links to their cloud spend for 2022[0], wherein lies the rub:

> In total, we spent $3,201,564 on all these cloud services in 2022. That comes out to $266,797 per month. Whew!

> For HEY, the yearly bill was $1,066,150 ($88,846/month) for production workloads only. That one service breaks down into big buckets as follows:

What the actual fuck? THREE MILLION DOLLARS? A million for their email service?? I have seen bills much larger, but for what 37signals does I am shocked. There is surely a ton of low hanging fruit to drop the bill despite the claim that it's as optimized as it can get. No way.

Even then, Hey is $99/year, and they claimed to have 25k users in the first month or so as of 2020, that's nearly $2.5MM. I presume they've grown since then. Another 2020 article[2] mentions 3/4 of their users have the iOS app, and the Android app currently shows "50k+ installs" so let's assume we're talking 200-400k users as a ceiling, ignoring attrition, which would pull $20-40MM. Even if it's half that, the cost doesn't seem unreasonable.

They're spending nearly $90k/mo on Hey. Of that the majority is RDS and OpenSearch. TFA makes it clear they know how to run MySQL, why on earth don't they stop running RDS? Both of these can easily be halved if they ran the services manually.

EKS is practically free so whatever. They state they have two deployments for ~$23k/mo total -- production is likely larger than staging but let's assume they're equal -- or ~$12k/mo each. A middle of the road EC2 instance like m4.2xlarge is less than $215/mo which gets more than enough cores and memory to run a rails app or two per node. That works out to around 55 nodes per environment. This benchmark[3] shows an m4.2xlarge can serve 172req/s via modern Ruby on Rails. At 500k users that works out to over 1600 request/user/day which seems excessive but likely within an order of magnitude of reality. These are the folks who wrote RoR so I would hope they can optimize this down further. <10000req/s for $12k/mo is pretty awful, and I'm being conservative.

Then let's talk about the $1MM/mo S3 bill. I'm not sure how to make 8PB cost that much but even the lightest touch at optimizing storage or compression or caching knocks the cost down.

This is all just nuts. There's no reason this all shouldn't be running on AWS or GKE with a much smaller bill. Their apps are predominantly CRUD, some email. Instead they replaced kubernetes with an in-house monstrosity.

I am bewildered.

[0] https://dev.37signals.com/our-cloud-spend-in-2022/

[1] https://www.theverge.com/2020/6/22/21298552/apple-hey-email-...

[2] https://m.signalvnoise.com/on-apples-monopoly-power-to-destr...

[3] https://www.fastruby.io/blog/rails/ruby/performance/how-fast...


I think RoR is the core problem here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: