De-cloud and de-k8s – bringing our apps back home

Ameo · on March 22, 2023

Regardless of the merits or drawbacks of "de-clouding" for this particular company, it seems to me that their ops team is just really bored or permanently unsatisfied with any solution.

They say that they've tried deploying their apps in all of:

* Their own Datacenter

* ECS

* GKE

* EKS

* and now back to their own Datacenter

Even with their new "de-clouded" deployment, it seems like they have created an absolutely immense amount of complexity to deploy what seems to be a variety of generic Ruby CRUD apps (I might be wrong but I didn't see anything different in the post).

They have a huge list of tools and integrations that they've tried out with crazy names; Capistrano, Chef, mrsk, Filebeat, traefik... It seems well on par complexity-wise with a full K8s deploy with all the bells and whistles (logging, monitoring, networking, etc.)

Google says that this company, 37signals, has 34 employees. This seems like such a monumental amount of orchestration and infra stuff unless they're deploying some crazy complex stuff they're not talking about.

Idk what the lesson is here, if there is one, but this seems like a poor example to follow.

tptacek · on March 23, 2023

We're talking about a product that has existed since 2004. They did:

* Their own data center, before Docker existed

* The non-K8s Docker way of deploying in AWS

* The GCP, and then AWS, ways of doing K8s Docker

* Docker in their own data center.

For 20 years of deployment, that doesn't look crazy to me.

The actual components of each environment are pretty standard. They wrote Capistrano, which predates Chef. Filebeat is a tiny part of ELK, which is the de facto standard logging stack. They use a smart reverse proxy, like... everybody else. It's easy to anything sound complicated if you expand the stack to as many acronyms as you can.

andrewf · on March 23, 2023

> * Their own data center, before Docker existed

Also, it might be worth calling out: their product launched in 2004, Linode and Xen were launched in 2003, S3 and EC2 launched in 2007. The cloud as we know it today didn't exist when they started.

grogenaut · on March 23, 2023

Pretty sure they knew the linode folks and were on there early iirc my history. This from hanging out with one of the linode owners back then randomly at a bar in stl

ghaff · on March 23, 2023

Whether DHH is "right" in some philosophical sense, this is a small company with a lot of technical experience in a variety of technologies and with presumably a lot of technical chops, so generalizing their experience to "cloud is good" or "cloud is bad" isn't really possible.

tptacek · on March 23, 2023

I mean, I work for a cloud hosting vendor. I'm not saying one side or the other is right, only that people who are dunking on 37signals for this are telling on themselves.

Rapzid · on March 23, 2023

Well, they were def at Rackspace in there somewhere.

mixdup · on March 23, 2023

"their own datacenter" both previously and now almost certainly means renting bare metal or colocation space from a provider. I highly doubt they have physically built their own datacenter from scratch

johnklos · on March 23, 2023

"renting bare metal or colocation space from a provider"

Those are two totally, completely different things. Their own datacenter means their own equipment in a datacenter and could even mean building out their own datacenter. It never, ever means renting bare metal.

mixdup · on March 24, 2023

>It never, ever means renting bare metal.

Weird, in my company where we are doing the opposite migration (from traditional datacenter where we manage the physical servers to Azure) this is exactly what we mean and say and how we describe it

We talk about "our datacenter" when we really mean racks of servers we rented from Insight, and we say "the cloud" when we refer to Azure. We've never actually had our own datacenter meaning a building we own and manage the entire physical plant of

Almost no one means it that way. Even Twitter is probably leasing colocation space in the "their own datacenter" category vs. GCP and AWS. The evidence is in the fact that Elon was able to just arbitrarily shut down an "entire datacenter". Or that 37signals was able to just arbitrarily move into "their own datacenter" on a whim

johnklos · on March 25, 2023

Referring to rented servers as colocated servers is flatly wrong, no matter how often people are incorrect about it. Sure, some providers put colocation under the same category as VMs and leased hardware, but that doesn't make them overlap.

OTOH, referring to a datacenter of servers that you lease as a datacenter is one thing, but if you have zero hardware that you own in it, would it really be your datacenter, or would it be "the datacenter"?

A datacenter could be anything from a set of IKEA shelves in a room with Internet and power to a fully built out fancy space with redundant power, fire suppression, a full Internet exchange, et cetera, so it's a bit gatekeepery to try to suggest that only huge companies would ever have their own datacenter or their own space with their own hardware in a datacenter.

Rapzid · on March 24, 2023

I'm sure that's the truth of the matter.

datadeft · on March 23, 2023

The fun part is that they do not understand what it means to have your "own datacentert" vs renting server in a co-lo. It does not matter if you are running on AWS on Hetzner it is somebody else's computer.

LastTrain · on March 23, 2023

We were a similar sized company at about the same time - we owned our data centers in the same way we owned our offices - we leased and occupied them. Sure, if the plumbing sprouted a leak the landlord would come to in and fix it, but no one would be confused enough to say we didn’t have our own office space.

imwillofficial · on March 23, 2023

"The fun part is that they do not understand" YES, 37Signals, I company with a legendary pedigree of pushing technical boundaries and open minded with deployment models totally doesn't know the simple thing that you do.

Get a grip.

fxtentacle · on March 23, 2023

You can rent entire rooms from Hetzner and then only you (and I believe the government firefighters) have access cards.

In any case, there are options where you 100% own and control all hardware.

johnklos · on March 23, 2023

What the heck are you talking about? Do you even know how colocation works?

For starters, even small companies can have their own physical datacenters, although that's not necessarily what we're talking about.

Second, renting hardware has absolutely nothing to do with colocation.

bennysonething · on March 24, 2023

What do 37 signals do that makes money?

jb_gericke · on March 23, 2023

Let’s write our own container orchestrator though, because control planes are dumb?

tptacek · on March 23, 2023

I don't understand how the first clause in this sentence connects to the second.

With a simple, predictable workload --- what they have --- it can make sense to lean towards static scheduling, rather than dynamic schedulers. K8s and Nomad are both dynamic schedulers.

This is pretty basic stuff; it's super weird how urgently people seem to want to dunk on them for not using K8s. It comes across as people not understanding that there are other kinds of schedulers; that "scheduling" means what Borg did.

Melingo · on March 23, 2023

Because they already had it running in k8s.

And k8s scales very well to very low and high numbers.

Because k8s provides battle tested Features like rollout lb etc.

And the ecosystem is great.

Certmanager, argocd kube stack.

I'm baffled tbh how they had such a difference experience with k8s than I do

mr_ndrsn · on March 23, 2023

We did! And it did work. And there are def some great things that I (we) love about k8s. Personally, the declarative aspect of it was chef's kiss. "I want 2 of these and 3 of these, please", and it just happens.

Which is the primary reason why we did investigate k8s on-prem. We had already done the work to k8s-ify the apps, let's not throw that away. But running k8s on-prem is different than running your own k8s in the cloud is different than running on managed k8s in the cloud.

Providing all of the bits k8s needs to really work was going to really stretch our team, but we figured with the right support from a vendor, we could make it work. We worked up a spike of harvester + rancher + longhorn and had something that we could use as if it were a cloud. It was pretty slick.

Then we got the pricing on support for all of that, and decided to spend that half million elsewhere.

We own our hardware, we rent cabs and pay for power & network. We've got a pretty simple pxeboot setup to provision hardware with a bare OS that we can use with chef to provide the common bits needed.

It's not 'ultimately flexible in every way', but it's 'flexible enough to meet the needs of our workloads'.

Bagged2347 · on March 24, 2023

What is your position at 37Signals and how do you like it? I'm really impressed by the innovation that comes out of you guys and the workplace culture you folks have.

mr_ndrsn · on March 25, 2023

I'm a Lead SRE on the Ops team. We've got a fantastic bunch of folks, they're amazing to work with!

prmoustache · on March 23, 2023

The main issue is the ecosystem imho.

Bare vanilla k8s or k3s is nice but it doesn't do much outside of your homelab. Once you want k8s on production in the cloud you have to start about thinking of: - loadbalancing and ingress controller - storage - network - iam and roles - security groups - centralized logging - registry management - vulnerability scanning - ci/cd - gitops

And all this is no less complex with k8s than with nomad, bare docker or whatever they chose. And definitely no less complex because it is on a major cloud provider.

Melingo · on March 23, 2023

In all managed services all of that comes out of the box.

Ingress, lb, storage, network...

And I have my small setup running with all of it too. Took me a weekend to set it up.

Rke2, nginx ingress, classic lb in front, cert manager and everything else in argocd.

0x500x79 · on March 23, 2023

Hey Melingo, I noticed that you responded to a lot of different threads in this post. It seems like you are a bit dismissive of people's experiences using K8s. I have also run K8s at scale, and it is not easy, it is not out of the box in cloud providers. There are a ton of addons, knobs, and work that has to be doen to build a sustainable and "production ready" version of K8s (for my requirements) in AWS.

K8s is NOT easy, and I do not believe that in it's current form it is the pinnacle of deployment/orchestration technologies. I am waiting for what is next, because the pain that I have personally experienced around K8s that I know others are feeling as well does not make it a perfect solution for everything, and definitely not usable for others.

At the end of the day it's a tool, and it is sometimes difficult to work with.

prmoustache · on March 23, 2023

Also when you do a mistake on a key part it can fail in a very spectacular way and it can be tricky to debug the issue immediately.

It is usually a game of finding the correct spaghetti(log) in a full plate.

Melingo · on March 23, 2023

I'm really only sharing my experience or view through my experience.

And I think it's the best thing for infra since pre cut bread.

In what issues did you run?

0x500x79 · on March 24, 2023

I know you are sharing your experience, others are as well. Let's not dismiss other's experience just because it doesn't match our own, the truth is most likely somewhere in the middle. Especially when so many people are clamoring saying that they had pain using K8s.

The initial deployment for EKS requires multiple plugins to get to something that is "functional" for most production workloads. K8s fails in spectacular ways (even using Argo, worse using Argo TBH) that require manual intervention. Local disk support for certain types of workloads is severely depressing. Helm is terrible (templating Yaml... 'nuff said). Security groups, IAM roles, and other cloud provider functions require deep knowledge of K8s and the cloud provider. Autoscaling using Karpenter is difficult to debug. Karpenter doesn't gracefully handle spot instance cost.

I could go on, but these are the things you will experience in the first couple days of attempting to use k8s. Overall, if you have deep knowledge of K8s, go for it, but It is not the end-all solution to Infra/container orchestration in my mind.

I fought with a workload for over a day with our K8s experts, it took me an hour to deploy it to an EC2 ASG for a temporary release while moving it back to K8s later. K8s IS difficult, and saying it's not has a lot of people questioning the space.

The way I see it is it starts off easy, and quickly ramps up to extremely complex. This should not be the case.

I worked at a company that had their own deployment infra stack and it was 1000x better than K8s. This is going to be the next step in the K8s space I believe and it may use K8s underneath the covers, but the level of abstraction for K8s is all wrong IMO and it is trying to do too much.

aflag · on March 23, 2023

Deploying a fixed number of servers to a fixed number of hosts has been battle tested for the past 40+ years. It does work.

Melingo · on March 23, 2023

It definitely does not.

The main issues we faced with over 700VMs were: outdated os, full disks, full inodes, broken hardware, missing backups or missing backup strategy, oom.

K8s health itself, fixes out of memory by restarting a pod, solves storage by shipping logs out and killing a pod in case it still runs full, has a rollout startegy, health checks and readiness probes.

It provides easy deployment mechanism out of the box, adding a domain is easy, certificates get renewed centrally and automatically.

Scaling is just a replica number and you have node Autoupgrade features build in.

K8s provides what people build manually out of the box, certified, open sourced and battle tested.

re-thc · on March 23, 2023

The difference is it's likely possible to have 7 physical servers replace those 700VMs when you have your own hardware without all the overhead.

It is much easier to maintain when you look at those numbers.

Melingo · on March 23, 2023

Not in my case.

Every VM had 4 cores and 20gig me.

Run on quite big blades

re-thc · on March 23, 2023

Your case is fine. AMD's 4th-Gen EPYC Genoa processors can do up to 192 cores and 384 threads in a single machine (2 socket) with TBs of RAM.

In most cloud environments a "core" can be just a thread.

Older machines have had 4-CPU socket based chassis with many cores as well. Definitely doable.

fxtentacle · on March 23, 2023

Ansible is your friend. Btw, we're talking about the team that built Capistrano, so they certainly know how to automate deployments.

Melingo · on March 23, 2023

Nope Ansible is horrible in comparison to k8s.

Alone the Paradigma shift from doing things step by step vs describing what you need and than things happen on it is a game changer.

K8s is probably 100x easier than Ansible.

And Ansible also has it's bigger ecosystem like Ansible tower.

Basically your k8s control plane but in bad

KronisLV · on March 23, 2023

> Alone the Paradigma shift from doing things step by step vs describing what you need and than things happen on it is a game changer.

I've actually used both in conjunction and it was decent: Ansible for managing accounts, directories, installed packages (the stuff you might actually need to run containers and/or an orchestrator), essentially taking care of the "infrastructure" part for on-prem nodes, so that the actual workloads can then be launched as containers.

In that mode of work, there was very little imperative about Ansible, for example:

  - name: Ensure we have a group
    ansible.builtin.group:
      name: somegroup
      gid: 2000
      state: present
  
  - name: Ensure that we have a user that belongs to the group
    ansible.builtin.user:
      name: someuser
      uid: 3000
      shell: /bin/bash
      groups: somegroup
      append: yes
      state: present

This can help you setup some monitoring for the nodes themselves, install updates, mess around with any PKI stuff you need to do and so on, everything that you could achieve either manually or by some Bash scripts running through SSH. Better yet, the people who just want to run the containers won't have to think about any of this, so it ensures separation of concerns as well.

Deploying apps through Ansible directly can work, but most of the container orchestrators might admittedly be better suited for this, if you are okay with containerized workloads. There, they all shine: Docker Swarm, Hashicorp Nomad, Kubernetes (K3s is really great) and so on...

tapoxi · on March 23, 2023

I'm on GKE. The hosts and control plane are managed for me. All I need to do is build/test/security scan images and then promote/deploy the image (via Helm) when it goes out to prod.

Using config management and introducing config drift and management of the underlying operating system is a lot more to think about, and a lot more that can go wrong.

vidarh · on March 23, 2023

Deploying a fixed number.of instances to a fixed number of servers does not imply doing it manually.

Melingo · on March 23, 2023

And I didn't say that.

We had all of these problems with self developed automatisation.

It still was garbage.

K8s just solves those issues out if the box.

vidarh · on March 23, 2023

So you did automatisation in a broken way. Here's one way to avoid the issues you described on bare metal:

- Only get servers with IPMI so you can remote reboot / power cycle them.

- Have said servers netboot so they always run the newest OS image.

- Make sure said OS image has a config that isn't broken so you don't get full inodes and so it cycles logs.

- Have the OS image include journalbeat to ship logs.

- Have your health checks trigger a recovery script that restarts or moves containers using one of a myriad of tools; monitoring isn't exactly a new discipline.

Yes, it means you have to have a build process for OS images. Yes, it means you need to pick a monitoring system. And yes, it means you need to decide a scheduling policy.

I wrote an orchestrator pre-K8S that was fewer LOC than the yaml config for my home test K8S cluster. Writing a custom orchestrator is often not hard, depending on your workload, - writing a generic one is.

K8S provides one opinionated version of what people build manually, and when it's a good fit, it's great. When it isn't, I all to often see people spend more time trying to figure out how to make it work for them than it would've taken them to do it from scratch.

prmoustache · on March 23, 2023

Your own failures do not define a model.

Melingo · on March 23, 2023

And?

My experience still counts for something and the example with those 700 VMS is something I didn't just saw once.

tptacek · on March 23, 2023

Having huge sprawling swarms of VMs is, for some teams, a problem to be solved, not a fact of life to be designed around.

Melingo · on March 23, 2023

Sry I'm not getting your point.

If I understand it right: VMS were not there because people needed VMS they were there because people needed compute.

We moved everything to k8s and we were able to do this because k8s can

tptacek · on March 23, 2023

The point is to deliver a small set of applications, not to come up with the most horizontally scalable possible deployment fabric.

vidarh · on March 23, 2023

I ran 1000+ VMs on a self developed orchestration mechanism for many years and it was trivial. This isn't a hard problem to solve, though many of the solutions will end up looking similar to some of the decisions made for K8S. E.g. pre-K8S we ran with an overlay network like K8S, and service discovery, like K8S, and an ingress based on Nginx like many K8S installs. There's certainly a reason why K8S looks the way it does, but K8S also has to be generic where you can often reasonably make other choices when you know your specific workload.

Melingo · on March 23, 2023

And you don't think k8s made your life much easier?

For me it's now much more about proper platform engineering and giving teams more flexibility again knowing that the k8s platform is significantly more stable than what I have ever seen before.

vidarh · on March 23, 2023

No, I don't for that setup. Trying to deploy K8S across what was a hybrid deployment across four countries and on prem, colo, managed services, and VMs would've been far.more effort than our custom system was (and the hw complexity was dictated by cost - running on AWS would've bankrupted that company)

imwillofficial · on March 23, 2023

[flagged]

Melingo · on March 23, 2023

I'm not bragging.

I'm not a 'bro' and 'cringe' this is not tiktok.

It gives context.

Don't you have anything to add to the discussion?

_ktx2 · on March 23, 2023

> They have a huge list of tools and integrations that they've tried out with crazy names; Capistrano, Chef, mrsk, Filebeat, traefik

These tools are pretty stock standard in the Systems Engineering world. I think anyone that's been a Systems Engineer that's over 30 has probably deployed every one of these.

One thing I've learned over my mixed SWE and SE career is that infrastructure is expensive and grows regardless of revenue. I didn't truly appreciate this until I launched Kubernetes on Digital Ocean and began running my personal cloud on it. It was costing me over $100/m for very little. That money was gone whether I pushed a ton of VPN traffic over my mesh or not. It didn't care about how much I stored in the disk I reserved, and frankly, that cost was going to grow as time went on. I pulled the plug, setup servers in my house, wired up Traefik and Docker Compose V2 with a little Tailscale sprinkled on. The servers stay up to date with some scripts, I deploy new apps on select servers with Docker Compose and Docker profiles.

It's possible for companies to do similar things, but not to the extremes I took it to. A really good infrastructure SWE generally goes for $300k. You can pay people with an expertise in these things who can streamline them and create maintainable products out of your infrastructure or you can pay for Legos and glue from a managed service provider like AWS, GCP, or Azure. At some point the latters costs will not scale, you'll pivot and cost reduce many times - maybe even begin rearchitecting. I think there's a lot of companies that are noe realizing the cheap money is gone, and the cloud has somewhat relied on cheap money.

jrumbut · on March 23, 2023

This is the company that gave birth to Ruby on Rails. They appear to have a culture of being very opinionated about their tools and unafraid of doing things their own way.

Probably not an example most companies that size ought to follow but I'm glad they were crazy enough to do it!

klabb3 · on March 23, 2023

I think you’re right that they doing it for fun or because they can, primarily. But I am excited to see them pioneer in this area, both because it’s more open and hacker friendly, but also because they’re moving the needle towards healthy competition amongst the providers.

Our big-three cloud hegemony has already showed it’s ugly sides, both in terms of price (egress, anyone?) and quality (hello, zero interop and opaqueness). I’d argue we’ve seen significant complexity increases in especially server-side tech in the last 5-10 years, with relatively little to show for it, despite massive economic investments. I expect that trend to continue downwards unless we take back the hacker friendliness of infra & ops.

PS. Actually scratch that I’m excited, that’s an understatement. I’m thrilled!

Melingo · on March 23, 2023

Kubernetes is a poster child on open source and transparent development.

And it pushed the needle tremendously.

I'm lost on how you can compare this even.

jb_gericke · on March 23, 2023

Pioneer? Other than rewriting docket swarm it sounds like a stack from the early 2000s…

hw · on March 23, 2023

I wonder how much of these movements are them iterating and hunting for ROI in their infrastructure costs. Did GCP and AWS salespeople sell them on the benefits of the cloud, offer discounts, white glove migration help, showed some calculation on how much $$ they will save in the cloud, etc that on paper sounded great, but wasn’t ultimately a good fit?

Their market is probably saturated and perhaps declining that they are reaching for optimizations elsewhere.

Jnr · on March 23, 2023

There is no such thing as "saving money in cloud". It is all about convenience and it always costs more than a smart team could achieve elsewhere.

I tend to hear an argument that it is cheaper since you do not have to pay people to maintain those services, but in reality you still need that person to set up and maintain your particular cloud setup. And the services themselves are much much more expensive than maintaining your own servers in a data center.

In my opinion cloud hosting and services are more meant for large corporations where no one wants to take responsibility and is scared of doing anything. Cloud is a nice way to shift the blame if/when things go bad - "but cloud is industry standard, everyone does it".

awesomegoat_com · on March 23, 2023

Indeed.

Hacker news crowd is drinking their own cool aid on this topic and not recognizing how much costs can be avoided if they just drop EKS from their stack.

Remember that in SRE all the abstractions are leaky and thus having more abstractions means having more complexity not less.

awesomegoat_com · on March 23, 2023

Yes, when that grows you can build a new corporate team babysitting control tower.

ExoticPearTree · on March 24, 2023

If you have a fairly stable traffic pattern, hosting your own stuff tends to be significantly cheaper than any cloud provider.

jillesvangurp · on March 23, 2023

When I read stuff like this it strikes me that probably, by far, their largest operational expense is their staffing cost to orchestrate all of this. I come from a background of running small startups on a shoe string budget. I need to make tough choices when it comes to this stuff. I can either develop features or start spending double digit percentages of my development budget on devops. So, I aim to minimize cost and time (same thing) for all of that. At the same time, I appreciate things like observable software, rapid CI/CD cycles, and generally not having a lot of snow flakes as part of my deployment architecture. I actually have worked with a lot of really competent people over the past two decades and I like to think I'm not a complete noob on this front. In other words, I'm not a naive idiot but actually capable of making some informed choices here.

That has lead me down a path of making very consistent choices over the years:

1) no kubernetes and no microservices. Microservices are Conways Law mapped to your deployment architecture. You don't need that if you do monoliths. And if you have a monolith, kubernetes is a waste of CPU, Memory, and development time. Complete overkill with zero added value.

2) the optimal size of a monolith deployment is 2 cheap VMs and a load balancer. You can run that for tens of dollars in most popular cloud environments. Good enough for zero down time deployments and having failover across availability zones. And you can scale it easily if needed (add more vms, bigger vms, etc.).

3) those two vms must not be snow flakes and be replaceable without fanfare, ceremony, or any manual intervention. So use docker and docker-compose on a generic linux host, preferably of the managed variety. Most developers can do a simple Dockerfile and wing it with docker-compose. It's not that hard. And it makes CI/CD really straight forward. Put the thing in the container registry, run the thing. Use something like Github actions to automate. Cheap and easy.

4) Use hosted/managed middleware (databases, search clusters, queues, etc). Running that stuff in some DIY setup is rarely worth the development time and operational overhead (devops, monitoring, backups, upgrades, etc). All this overhead rapidly adds up to costing more than years of paying for a managed solution. If you think in hours and market conform rates for people even capable of doing this stuff, that is. Provision the thing, use the thing, and pay tens of dollars per month for it. Absolute no brainer. When you hit thousands per month, you might dedicate some human resources to figuring out something cheaper.

5) Automate things that you do often. Don't automate things that you only do once (like creating a production environment). Congratulations, you just removed the need for having people do anything with teraform, cloudformation, chef, puppet, ansible, etc. Hiring people that can do those things is really expensive. And even though I can do all of those, it's literally not worth my time. Document it, but don't automate it unless you really need to and spend your money on feature development.

But when I need to choose between hiring 1 extra developer or paying similarly expensive hosting bills, I'll prefer to have the extra developer on my team. Every time. Hosting bills can be an order of magnitude cheaper than a single developer on a monthly basis if you do it properly. For reference, we pay around 400/month for our production environment. That's in Google cloud and with an Elastic Cloud search cluster included.

Other companies make other choices of course for all sorts of valid reasons. But these work fine for me and I feel good about the trade offs.

roncesvalles · on March 23, 2023

Agree entirely. I think system design interviews are partly to blame because they select for people who think that the only way to design a system is the cargo cult method that interview prep books and courses preach, which is:

- break everything into microservices

- have a separate horizontally scalable layer for load balancing, caching, stateless application server, database servers, monitoring/metrics, for each microservice.

- use at least two different types of databases because it's haram to store key-value data in a RDBMS

- sprinkle in message-passing queues and dead-letter queues between every layer because every time you break one system into two, there can be a scenario where one part is down but the other is up

- replicate that in 10 different datacenters because I'll be damned if a user in New York needs to talk to a server in Dallas

And all this for a service that will see at most 10k transactions per second. In other words, something that a single high-end laptop can handle.

99.9% of the time your architecture does NOT need to look like Facebook's or Google's. 99% of tech startups (including some unicorns) can run their entire product out of a couple of good baremetals. Stop selecting for people who have no grounding of what is normal complexity for some given scale.

jspdown · on March 23, 2023

I can't agree more on this. Most products out there with medium to low traffic can be handled just fine like this. The cost of automation is often not worth the financial effort.

There's a dangerous trend in putting microservices everywhere. Then having the same level of quality as a monolith requires an infinite amount of extra work and specialized people. Your product must be very successful to justify such expenses!

My rule of thumb; monolith and PaaS as long as your business can afford to.

hello_moto · on March 24, 2023

I mean it all makes sense if you know nothing of k8s or ansible.

Most companies these days had moved to k8s so there are a portion of hi tech workers that have prior knowledge of k8s model and deployment.

Whether you want to go monolith or not doesn't matter because you need to replicate the process at least to 2 environment: dev and prod. Not to mention it's good to be prepared had your prod env got compromised or nuked.

bsenftner · on March 23, 2023

Where, oh god where, are there more sensibly thinking people like you! This is pragmatic and straight forward. There is very little room for technical make work nonsense in your described strategies. Most places, and many devs I meet cannot imagine how to do their jobs without a cornucopia of oddly named utilities they only know a single path of use.

tetha · on March 23, 2023

This is actually a really interesting post to me. I'm currently working at the opposite of a startup with a shoestring budget. We're a medium-sized company with 100 - 150 techies in there. As a unique problem, we're dealing with a bunch of rather sensitive data - financial data, HR data, forecast and planning data. Our customers are big companies, and they are careful with this data. As such, we're forced to self-host a large amount of our infrastructure, because this turns from a stupid decision into a unique selling point in that space.

From there, we have about 7 - 12 of those techies working either in my team, saas operations, our hardware ops team, or a general support team for CI/images/deployment/buildserver things. 5 - 10% of the manpower goes there, pretty much.

The interesting thing is: Your perspective is our dream vision for teams running on our infrastructure.

Like - 1 & 2 & 3: Ideally, you as the development team shouldn't have to care about the infrastructure that much. Grab the container image build templates and guidelines for your language, put them into the template nomad job for your stuff, add the template pipeline into your repository, end up with CD to the test environment. Add 2-3 more pipelines, production deployments works.

These default setups do have a half life. They will fail eventually with enough load and complexity coming in from a product. But that's a "succeed too hard" kinda issue. "Oh no, my deployment isn't smooth enough for all the customer queries making me money. What a bother" And honestly, for 90% of the products not blazing trails, we have seem most problems so we can help them fix their stuff with little effort to them.

4 - We very much want to standardize and normalize things onto simple shared services, in order to both simplify the stuff teams have to worry about and also to strengthen teams against pushy customers. A maintained, tuned, highly available postgres is just a ticket, documented integrations and a few days of wait away and if customers are being pushy about the nonfunctional requirements, give them our guarantees and then send them to us.

dgroshev · on March 24, 2023

The only point I disagree with is Terraform. It is brilliant for this exact scenario because it's self documenting. When you do need to update those SPF records in two years time, having it committed as a Terraform file is much better than going through (potentially stale) markdown files. It's zero maintenance and really simple. Plus its ability to weave together different services (like configuring Fastly and Route53 from the same place) is handy, too.

datadeft · on March 23, 2023

What if I do this with Terraform using AWS Serverless and staying in the free tier for this workload that you are referencing instead of VMs and a load-balancer?

I just don't see why people prefer the VM based approach over serverless.

re-thc · on March 23, 2023

If you can stay in the free tier you likely don't need a load-balancer either.

spiralpolitik · on March 23, 2023

From my experiences there are two lessons:

There is usually a sweet spot in terms of size where being on the public cloud makes sense, both from a cost and management perspective. Once you go above that size then having to manage IAM starting becoming a pain. Usually around the same point public cloud costs start becoming noticeable to your finance team and so you have to start dealing with questions around that. Usually that's a good point to do a sanity check before things get even bigger and more expensive.

Similar k8s works well for certain classes of problem, but doesn't work well for all classes of problem. Any form of k8s has an operational overhead and you really need to make sure that you are going to get the ROI from the effort of maintaining the stack for it to be worthwhile.

cyberax · on March 23, 2023

> having to manage IAM starting becoming a pain

Just use multiple AWS accounts. You won't need any complicated IAM policies.

datadeft · on March 23, 2023

> Idk what the lesson is here, if there is one, but this seems like a poor example to follow.

The lesson is not to focus on tech tooling as much and focus more on product instead. Imagine this energy doing into the product....

mpweiher · on March 23, 2023

> Idk what the lesson is here

I'd say the lesson is that we, as an industry, haven't figured out this "cloud" stuff yet.

And it looks to me that what we need, roughly, is some sort of "deployment polymorphism" that separates interface from implementation.

shafyy · on March 23, 2023

In addition to what other commentators have stated: 37signals has ~ 80 employees, not 34.

dpkirchner · on March 23, 2023

> They have a huge list of tools and integrations that they've tried out with crazy names; Capistrano, Chef, mrsk, Filebeat, traefik

I use a lot of this or similar (terraform instead of Chef, logstash instead of filebeat) and I'm a one person team. If anything these tools make my job a lot easier and less complex.

hsn915 · on March 23, 2023

This is very common in almost all web companies since around 2015.

I've never seen a company with a simple infrastructure, no matter how simple their actual application actually is.

If you choose a slow dynamic language (Ruby/Python) your deployment has to be massively complicated; you have no choice about it.

For one simple reason: you will need a multitude of separate components to be made to work together.

You need many application instances because there's no way one machine can handle all your traffic: Ruby is just too slow.

A sharded database cluster as a source of truth:

You went through the effort of making several applications instances with a load balancer: you don't want a single database server to be a single point of failure.

A distributed redis/memcache index to accelerate queries and lower the pressure on the real database.

You might have several index-type engines for different types of quries. Most people use ElasticSearch in addition to Redis.

You need some system to manage all this complexity: monitor it, deploy new versions, rollback to a previous version, run migrations and monitor their state, etc etc.

This is the bare minimum. Most people have a setup that is way more complicated than this. I don't even know how they come up with these complexties, but they not only come up with frequently: they love it! To them it's a feature, not a bug.

bayesian_horse · on March 23, 2023

You are making a lot of assumptions and many of those are not universal problems or even at all.

Compiled languages eventually need a complicated setup for the very same reasons. There is no such thing as "scales" and "doesn't scale". Even Go or C++ webapps have to be scaled up.

If you can get away without complexity on Go or whatever, good for you. Most companies don'T.

hw · on March 23, 2023

So, you’re explaining a stack with:

- application instances

- load balancer

- database

- cache

- search cluster if application search is necessary

Sounds like any cookie cutter application to me, even modern ones. How is that complicated?

hsn915 · on March 23, 2023

It's way too complicated. But if this is all you have ever seen and if you've been designing such systems for a decade, this seems like normal to you.

Here's an alternative stack that can handle over 99% of websites:

- Self contained executable

- One-file database

- Cache is memory

- Text search is a library function

- Indexing is a library function

- Serving http is a library function

Such a stack can handle > 20k concurent connections (per second). The code doesn't need to be "optimized"; just non-pessimized.

You can scale "vertically" for a very long time, specially in 2020 and beyond, where you have machines with over 64 CPU cores. That's almost a cluster of 64 machines, except in one single machine.

If you _must_ scale horizontally for some reason - maybe you are Twitter/Facebook/Google - then you can still retain the basic architecture of a single executable but allow it to have "peers" and allow data to be sharded/replicated across peers.

Again all the coordination you need is just a library that you embed in the program; not some external monster tool like k8s.

kgeist · on March 23, 2023

There are several reliability issues:

  1) a single panic/exception/segfault in the executable brings down the whole website and so it will be unavailable until the executable restarts

  2) entropy *always* increases (RAM usage, memory corruption, hardware issues, OS misconfiguration etc.) so eventually the application will break and stop serving traffic until it's repaired/restarted (which can take time if it's a hardware issue)

  3) deployments are tricky if there's nothing before the executable (stop, update, restart => downtime)

  4) if cache is in-process, on a restart it will have to be repopulated from scratch, leading to temporary slowdowns (+ and maybe a thundering herd problem) which will happen *every time* you deploy an update

I think much of it is ignoreable if the site is just a personal blog or a static site. But if the site is a real time "web application" which people rely on for work, then you still need:

  1) some kind of containerization, to deal with inevitable entropy (when a container is restarted, everything is back to the initial clean state)

  2) at least two instances of the application: one instance crashes => the second one picks up traffic; or during rolling updates: while one instance is being killed and replaced with a new version, traffic is routed to another instance

  3) persistent data (and sometimes caches) need to be replicated (and backed up) -- we've had many hardware issues corrupting DBs

  4) automatic failover to a different machine in case the machine is dead beyond repair

>not some external monster tool like k8s

What can you use instead of k8s for this kind of scenario? (an ultra reliable setup which doesn't need a whole cluster)

dgroshev · on March 23, 2023

It seems to me that people tend to vastly overestimate their uptime requirements. "Real time 'web application'" used by hundreds of millions of people can be down for hours and yet succeed wildly, just look at Twitter, both its old failwhale and new post-Musk fragile state. Complexity, on the other hand, and thus lower iteration speed and higher fixed costs can kill a business much easier than a few seconds of downtime here and there.

You don't need an "ultra reliable setup" or even a "cluster". You can have one nginx as a load balancer pointing at your unicorn/gunicorn/go thing, it's very unlikely to ever go down. You can run a cronjob with pgdump and rsync, in an off chance your server dies irrecoverably corrupting the DB (which is really unlikely for Postgres), chances are your business will survive fifteen minutes old database.

Most "realtime web applications" are not aerospace, even though we like to pretend that's what we work on. It's an interesting confluence of engineering hubris and managerial FOMO that got us here.

srehell · on March 28, 2023

> It seems to me that people tend to vastly overestimate their uptime requirements. "Real time 'web application'" used by hundreds of millions of people can be down for hours and yet succeed wildly

That may be true for social media apps where the Terms of Service don't include any SLAs/SLOs, but if you're a SaaS company of any kind, the agreements with clients often include uptime requirements. Their engineers will often consider some form of "x number of nines" industry standard.

kgeist · on March 23, 2023

In the projects I work on, things go down all the time, for various reasons (hardware issues, networking problems, cascading programming errors). It's the various additional measures we have put in place which prevent us from having frequent outages... Before the current system was adopted, poor stability of our platform was one of the main complaints.

I agree that for many projects it may be an overkill.

dgroshev · on March 23, 2023

Networking issues and even hardware issues are very unlikely if you can fit everything into one box, and you can get a lot in one box nowadays (TB+ RAM, 128+ core servers are now commodity). MTBF on servers is on the order of years, so hardware failure is genuinely rare until you get too many servers into one distributed system. And even then, two identical boxes (instead of binpacking into a cluster, increasing failure probability) go a very long way.

It's a vicious circle. We build distributed multi-node systems, overlay software-configured networks, self-healing clusters, separate distributed control planes, split everything into microservices, but it all makes systems more fragile unless enough effort is spent on supporting all that infrastructure. Google might not have a choice to scale vertically, but the overwhelming majority of companies do. Hell, even StackOverflow still scales vertically after all these years! I know startups with no customers who use more servers than StackOverflow does.

hsn915 · on March 23, 2023

Re: Crashes.

If there's a bug that brings the server down, it will happen in all instances and repeatedly no matter how many times you restart. Specially when the users keep repeating the action that triggered the crash.

Re: Entropy. Entropy increases with complex setup. The whole point of not having a complex setup is to reduce entropy and make the system as whole more predictable.

Re: caches. There are two types of caches: indicies that are persisted with the database, and LRU caches in memory. LRU caches are always built on demand so this is not even a problem.

Plus modern CPUs are incredibly fast and can process several GBs of data per second. Even in the worst cases, you should be able to rebuild all your caches in a second.

kgeist · on March 23, 2023

>If there's a bug that brings the server down, it will happen in all instances and repeatedly no matter how many times you restart.

Not necessarily so. Many bugs are pretty rare bugs which are triggered only under specific conditions (a user, or the system, must do X, Y, Z at the right moment). So it doesn't happen all the time. But when it happens, the whole server crashes or starts behaving in a funky way and other users are affected. Sure you may say if it's a rare bug, then users will be rarely affected. But we don't have a single bug like that, there's always N such bugs lurking around (we never know how many of them in a large application); multiply it by N bugs and you have server crashes for different reasons quite often, making your paying customers dissatisfied. It also assumes you can fix such a bug immediately while it's not always true, there's often Heisenbugs it takes weeks to root out and fix, while your customers are affected (sure the application will restart but ALL users (not just the one who triggered the bug) can loose work, get random errors when the app is not available -- not a good experience). So having several app instances for backup allows to soften such blows, because there will always be at least one app instance which is available.

>Entropy increases with complex setup. The whole point of not having a complex setup is to reduce entropy and make the system as whole more predictable

I agree that entropy increases with complex setup, but there's also base entropy which accumulates simply because of time (which I think is more dangerous). Like make a sufficient number of changes to the setup of your application (which you often need if you release often) and eventually someone or something somewhere will make a mistake or expose a bug somewhere, and you will need to repair it and you won't be able do it easily because your setup is not containerized which would allow to return to the clean state quite easily with no effort. We've had issues like that with our non-containerized deployments and it's a very complex and error-prone undertaking to do it flawlessly (no downtime or regressions) compared to containerized deployments.

>Plus modern CPUs are incredibly fast and can process several GBs of data per second. Even in the worst cases, you should be able to rebuild all your caches in a second

Hm, usually caches are placed in front of disk-based DB's to speed up I/O, i.e. it's not a matter of slow CPU's, it's a matter of slow I/O. Rebuilding everything which is in the caches from DB sources is not super fast.

the8472 · on March 23, 2023

> and you will need to repair it and you won't be able do it easily because your setup is not containerized which would allow to return to the clean state quite easily with no effort.

Automated deployment including server bringup is orthogonal to using containers or hot failover. For example at $WORK we're deploying Unreal applications to bare metal windows machines without using containers because windows containers aren't as frictionless as linux ones and the required GPU access complicates things further.

prmoustache · on March 23, 2023

Note that you can totally have more than one instance of the same app/binary running on the same machine. You don't even need containers for that.

zerd · on March 23, 2023

But then you need some kind of load balancer, which hsn915 said was "too complicated".

the8472 · on March 23, 2023

Upfront customer requirements often say they want >99.5% uptime (which allows for 3.5h downtime a month anyway) or some such. In practice B2B customers often don't care much if hour-long downtimes happen every week during off-hours. Sometimes they're even ok when it gets taken down over a whole weekend. Things serving the general public have different requirements but even they have their activity dips during the late night where business impact of maintenance is much lower.

satvikpendem · on March 23, 2023

> 2) entropy always* increases (RAM usage, memory corruption, hardware issues, OS misconfiguration etc.) so eventually the application will break and stop serving traffic until it's repaired/restarted (which can take time if it's a hardware issue)*

This is not what entropy means. Even if you constrain it to hardware, there is no reason to think that this will happen eventually, unless your timeline is significantly long.

yobbo · on March 23, 2023

Also, there are typically multiple processes. A panic stops only one process.

porker · on March 23, 2023

> - Text search is a library function

What text search will provide me with the same features as Elasticsearch? Index time analysis, stemming, synonyms; search time expansions, prefix matching, filtering and (as a separate feature) type ahead autocomplete?

I would love to never touch another Elasticsearch cluster so this is a genuine question.

hsn915 · on March 23, 2023

What about any of this prohibits it from being a library?

https://lucene.apache.org/core/

This is the Java library that ES is based on. Without even having to look at it I can make the following judgement:

It should be easy to port to any language.

It's open source, and it's Java. Java has no special features that makes it impossible or particularly difficult to replicate this functionality in any other compiled language, like C, Rust, Go, or any other language that is not 100x wasteful of system resources.

porker · on March 23, 2023

> This is the Java library that ES is based on.

Based on, but Elasticsearch is not just a server wrapped around the library. Features ES has are not in Lucene, otherwise anyone could release a competitor by wrapping the library.

> It should be easy to port to any language.

You win the "Most Hacker News comment of March 2023" award. This thread is talking about less effort, and you bring up porting Lucene to another programming language.

lmz · on March 23, 2023

I thought it was already ported to other languages eg. https://clucene.sourceforge.net/

Not sure about feature parity though.

prmoustache · on March 23, 2023

> Based on, but Elasticsearch is not just a server wrapped around the library. Features ES has are not in Lucene, otherwise anyone could release a competitor by wrapping the library.

Those competitors exist.

tm-guimaraes · on March 23, 2023

Go is not less wasteful than java, both are garbage collected and their memory pressure depend highly on the given workload, and the runtime of the program. But java allow more GC tuning and even different GCs for different use cases (ie: shenadoah and ZGC favor very low latency workloads, while the default G1GC favors throughout (not that simple, but you get the point))

Regardless, Java/Go tier of performance is good enough for this kind of thing.

hsn915 · on March 23, 2023

I was referring to Ruby/Python when I said 100x wasteful languages.

re-thc · on March 23, 2023

Problem is it doesn't support HA. You're stuck on that single server model. Upgrades always = downtime = painful. You're also missing things like self-healing and your Lucense index can corrupt.

Real world experience says better to move away from it e.g. lots of self-hosted Atlassian instances over the years. Lucene was a major pain point.

snikolaev · on March 23, 2023

Manticoresearch provides mosts of the listed features.

porker · on March 23, 2023

Thanks for the reminder. Manticoresearch is an alternative I haven't tried yet. I tried the hip alternatives (Melisearch, Typesense) in autumn 2022 and both were severely lacking for CRM workloads compared with ES.

hardware2win · on March 23, 2023

>- One-file database

SQLite?

If yes, then I dont really believe that you can have 20k concurrent users where significant part goes to db, not cache.

But Ive been messing with just 1 vCore, so.

hsn915 · on March 23, 2023

You can always put an LRU cache between you and SQLite.

I personally moved from SQLite to a B-Tree based key-value store, and most requests can be serviced in ~500us (that is microseconds). I don't mean a request to the database: I mean a request from the client side that queries the database and fetches multiple objects from it. In my SQLite setup, the same kind of query would take 10ms (that is 20x the time) even _with_ accelerator structures that minimize hits to the DB.

But you can always scale up vertically. You can pay $240/mo for 8 vCPUs with 32GB of RAM. Much cheaper than you would pay for an elastic cloud cluster.

osigurdson · on March 24, 2023

>> ~500us (that is microseconds).

500us is slow. This kind of performance does not remotely obsolete an LRU cache (main memory access is ~5000X faster).

500us is essentially intra-datacenter latency. Obviously your data is in memory on the B-Tree server as there is no room in this budget for disk IO. Postgres will perform just as well if data is in memory hitting a hash index (even B-Tree probably). I don't think the B-Tree key-value store you mention is adding much. Use Redis or even just Postgres.

malborodog · on March 23, 2023

When you say text indexing and serving http are library functions, what do you mean? Also, is the language here go or what? Since you said python is too slow and then necessitates all the infra to manage it.

hsn915 · on March 24, 2023

Go or any language that actually gets compiled down to machine code to get executed directly on the hardware, and where libraries are compiled into the final product.

When I say something is a library function, I mean you just compile it into your code. In your code, you just call the function.

This is in contrast to the current defactor practice of making an http request to ask another program (potentially on a different machine) to do the work.

malborodog · on March 24, 2023

Beautiful. Got it; thank you.

kgeist · on March 23, 2023

Sometimes I think, maybe our complex cluster which runs PHP software (load balancer, app instances, cache etc.) can be replaced with a single performant machine running something like Rust

hsn915 · on March 23, 2023

It can. You don't even need to go all the way to Rust. I'm doing it with Go, which has a GC and a runtime. A single executable on a single machine can handle millions of users per month.

butt__hugger · on March 23, 2023

37signals and RoR have a habit of flip flopping on their decisions. See CoffeeScript.

tptacek · on March 23, 2023

Each of these "flip flops" probably lasted a good deal longer than the median 20+ person startup, so that seems pretty facile. But the parallel with CoffeeScript seems valid --- people on message boards are really not OK with nonstandard languages, and are never less happy than when a company they've heard something about does actual computer science of any kind. See, for instance, Fog Creek and Wasabi.

mirekrusin · on March 23, 2023

CoffeeScript impacted ECMAScript more than anything else.

cutler · on March 23, 2023

There's an operator for that.

tptacek · on March 23, 2023

Skimming the thread here, it seems like there's some confusion about the goals:

* They've decided to move from EKS to on-prem largely because of cost. That's logical: almost by definition, it costs more to run workloads on cloud machines than on your own hardware. You can't address that problem by moving from EKS back to ECS, like one commenter suggested.

* They've decided to move from K8s to mrsk, a system they developed. They're fuzzier about why they did that, but the two fairly clear claims they made: (1) their deployments under K8s are a lot more complicated, and (2) they slashed their deploy times (because a great deal of their infra is now statically defined).

I feel like there's more productive debating to be done about K8s vs mrsk than there is about EKS vs. mrsk. By all means, make the case that applications like Tadalist are best run on K8s rather than a set of conventions around bare-metal container/VMs (which is all mrsk is).

cortesoft · on March 23, 2023

Yeah, I would love to hear more about why they decided not to go with on prem k8s... the other arguments made logical sense to me, but they don't explain the reasoning for mrsk very well.

mike_d · on March 23, 2023

Every company that I have been at that uses k8s at scale ends up having an internal team to manage the complexities and build internal tooling to make it work. It sounds like they left behind a lot of the cruft and just built a tool that does the one thing most people want: put a container on a VM and call it good.

Melingo · on March 23, 2023

K8s is very easy to keep maintained, especially because even for self hosted there are plenty of management tools around.

And it allows your teams to deploy everything they need without a admin team.

Even ingress you can define

iampims · on March 23, 2023

K8s on bare metal is not easy to maintain.

Melingo · on March 23, 2023

Have you tried using gardener or rancher or ubuntus solution on bare metal?

They are very easy to use

roncesvalles · on March 23, 2023

That's the thing. On-prem K8s doesn't mean deploying a vanilla Kubernetes using instructions from kubernetes.io. There is an entire industry of proprietary solutions for running Kubernetes on-prem. RedHat Openshift, Rancher, Pivotal PKS, VMWare Tanzu come to mind.

prmoustache · on March 23, 2023

I don't know when they decided to do that transition but back when I tried rancher a few years ago (when they were transitionning from rancher 1.x to 2.x) it was a real bug festival. I think the only robust solution at the time was openshift which was well, k8s without being vanilla k8s.

Also most tools that were build to manage k8s cluster were nice to deploy a new cluster, not so much to upgrade a cluster so you would have to create new clusters every time you wanted an upgrade. It can scale when clusters and blast radius is small but can be complicated when it involves contributions from n teams. For this reason when we were managing our own k8s cluster on prem, we were using kubespray which worked but upgrades were a multiple hours affair.

Melingo · on March 23, 2023

That's a real good point you mentioned: k8s ecosystem is super young.

And so so much changed in the last 4 years.

But at least for me, the 'easy to use' threshold happend somewhere like 2-3 years ago.

And Gardner for example upgrades quite well.

Rke2 is quite stable for me but rancher integration is still not perfect.

But even doing k8s by hand with Ansible was already double 3 years ago. That's what I started and I had it up and running. I switched to rke2 because I realized that this will not be sustainable/ is not worth it to do it myself on this level.

satvikpendem · on March 23, 2023

I haven't used k8s in quite a few years, what would you recommend I look at these days to get a good overview and understand all the different pieces in the ecosystem?

Melingo · on March 23, 2023

Unfortunately I don't have a good blog article about this.

I actually thought it would be good to write a k8s blog article After Reading this blog.

If you can, click yourself a small cluster in one of the big cloud providers.

Alternativly I think Google has some k8s colabs were you can try it out.

Setting up a small application yourself or looking into what helm charts exist might help .

Like the helm charts for known open source projects like PostgreSQL from bitnami.

xyzzy_plugh · on March 23, 2023

> By all means, make the case that applications like Tadalist are best run on K8s rather than a set of conventions around bare-metal container/VMs (which is all mrsk is).

Okay sure I'll bite. An application like Tadalist is best run on k8s.

With any application regardless of how it runs, you generally want at least:

- zero-downtime deploys

- health checks

- load balancing

Google's GKE is like $75/mo, and the free tier is one cluster, which is enough. For nodes, pick something reasonable. We're naive so we pick us-west1 and a cheap SKU with 2 vCPUs 8 GB is ~$30/mo after discounts. We're scrappy so we eschew multiple zones (it's not like the nearby colo is any better) so let's grab two of these at most. Now we're in $60/mo. We could go cheaper if we want.

We've click-opsed our way here in about 25 minutes. The cluster is up and ready for action.

I write a Dockerfile, push my container, install k3d locally, write about 200 lines of painstaking YAML that I mostly cribbed off of stack overflow, and slam that through kubectl into my local cluster. Everything is coming up roses, so I kubectl apply to my GKE cluster. My app is now live and I crack open a beer. Today was a good day.

Later, whilst inebriated from celebration, I make some changes to the app and deploy live because I'm a cowboy. Oops, the app fails to start but that's okay, the deployment rolls back. No one notices.

The next day my app hits the front page of HN and falls over. I edit my YAML and change a 2 to a 10 and everything is good again.

Things I did not need to care about:

- permissions (I just used the defaults and granted everything via clickops)

- ssh keys (what's ssh?)

- Linux distributions or VM images (the Goog takes care of everything for me, I sleep better knowing I'll wake up to patched OS)

- passwords

- networking, VIPs, top of rack switches, hosting contracts, Dell, racking and stacking, parking, using my phone

And all without breaking the bank.

---

Okay so I cheated, you weren't looking for a GKE vs on-prem/Colo case. You asked

> make the case that applications like Tadalist are best run on K8s rather than a set of conventions around bare-metal container/VMs

to which I say: that's all kubernetes is.

Did you even read their blog post? virtio? F5? MySQL replication?? How is this a good use of engineering time? How is this cost efficient? On what planet is running your own metal a good use of time or money or any other godforsaken resource. They're not even 40 people for crying out loud. It's not like they're, say, Fly.io and trying to host arbitrary workloads for customers. They're literally serving rails apps.

Want to start small with k8s? Throw k3s or k3d on a budget VPS of your choosing. Be surprised when you can serve production traffic on a $20 Kubernetes cluster.

If you care about Linux distributions, and care about networking, and care about database replication, and care about KVM, and care about aggregating syslogs, and love to react to CVEs and patch things, and if it's a good use of your time, then sure do what 37signals did here. But I'm not sure what that planet is. It's certainly not the one I live on today. 10-15 years ago? Sure. But no longer.

I can't believe just how ridiculous this entire article is. I want to find quotes to cherry pick but the entire thing is lunacy. You can do so so much on a cloud provider before approaching the cost of even a single 48U in a managed space.

At some scale it makes sense, but not their scale. If I never have to deal with iDRAC again it'll be too soon.

You have a horse in this race: apps like Tadalist are best run on something like Fly or knative/Cloud Run or Heroku rest in peace. But a set of conventions around bare-metal containers/VMs? Give me a break.

I don't think you intended it, but I find it disingenuous to separate cloud hosting and kubernetes. The two are connected. The entire premise is that it should be a set of portable conventions. I can run things on my laptop or desktop or raspberry pi or $10/mo budget VPS or GCP or AWS or Azure or Linode or god willing a bunch of bare-metal in a colo. It's fundamentally a powerful abstraction. In isolation it makes little sense, which TFA handedly demonstrates. If you eschew the conventions, it's not like the problems go away. You just have to solve them yourself. This is all just NIH syndrome, clear as day.

Forgive the long winded rant, it's been a long day.

ai_ja_nai · on March 23, 2023

Agree, I would never want to go back to the old bad days of managing a real rack at a datacenter, with exactly the same guarantees of a single region deployment inside any cloud. BUT it is true that all the multi region/AZ guarantees + logs + dashboards + network services @ AWS costs tend to skyrocket in a couple of years. And here is where k8s really shines, in my opinion: allowing you to abstract your deployment away from a cloud even on cheap hosting. All the rest outlined in the article is just reinventing the wheel.

prmoustache · on March 23, 2023

Usually those aren't the same engineers that manage the racked stuff in a datacenter and those that deploy the appps.

Last time I was working on prem, we would just buy a new 2U hypervisor server once in a while. Apps were all running on VMs anyway so the complexity was not seen by the same people. Storage were a multiple years deal. The biggest issue was storage estimation and paying from day 1 a storage that would be used fully only on year 5. But I don't think it was that expensive, just an accountability gymnastic comparing to a pay as you go system. And hyperconvergence was kind of meant to solve that although I didn't really had the chance to experiment with it in virtualized environments on prem.

datadeft · on March 23, 2023

> That's logical: almost by definition, it costs more to run workloads on cloud machines than on your own hardware

Only if you are not adjust the workload. Lift+shift is cost more. Re-architecting and making the workload cloud ready cost less.

rednerrus · on March 23, 2023

Who's gonna do the rearchitecting work? Are you hiring a whole new team or do you not need to keep the lights on while you're transitioning? Depending on the complexity of your application that rearchitecting is gonna eat up a ton of your cost savings.

lmm · on March 23, 2023

> almost by definition, it costs more to run workloads on cloud machines than on your own hardware.

Why should that be so? I'd expect the all-in cost of a cloud machine to be less than my own hardware, for the same reason that buying electricity from the grid is usually cheaper than generating it on-prem.

> You can't address that problem by moving from EKS back to ECS, like one commenter suggested.

If EKS is more expensive (because it's something they see as a value-add) whereas ECS is a commodity service at commodity prices, then moving there could well solve the cost issue.

koolba · on March 23, 2023

A better analogy for cloud vs on-premises is going to a restaurant vs cooking. The markups are about the same too.

harpratap · on March 23, 2023

Wouldn't the cost of cooking be higher depending on who you are? If one could spend few hours doing something that has higher ROI than eating pre-cooked food, then you are actually losing money cooking your own food.

lmm · on March 24, 2023

Yep, which makes it a pretty good analogy.

prmoustache · on March 23, 2023

This virtually never happen with the cloud.

tekno45 · on March 23, 2023

You're paying for someone elses profit monthly with the cloud.

It has to cost more.

koreth1 · on March 23, 2023

Beyond a certain scale, sure. But at small scale, you can completely avoid hiring an ops team, or hire a much smaller one, which can more than offset the cloud provider price premium.

My current company works in a niche market with a smallish number of large customers, so our scaling needs are modest. Our total AWS bill is about a third the annual salary of a single ops person.

There's gotta be a very long tail of companies like mine for whom outsourcing to cloud vendors is cheaper than self-hosting.

weq · on March 23, 2023

Depends on the industry and barrier of entry. If your in one with alot of compliance overheads your are outsourcing alot more them compute and storage to your cloud provider. Hiring inhouse in that same case its extremely expensive unless you are over a certain size.

This article seems written by someone who gets excited by shiney objects / hype trains.

Longwelwind · on March 23, 2023

What if someone's else profit is due to economies of scale?

waynesonfire · on March 23, 2023

With like 40-50% margins too.

ai_ja_nai · on March 23, 2023

> Why should that be so? I'd expect the all-in cost of a cloud machine to be less than my own hardware

Because cloud hardware doesn't have all the burdens of physically managing a real server. Replacing SSDs. Upgrading RAM. Logging to a iDRAC to restart a crashed server. All those things don't exist in the cloud and make you loose so much operational time. That's why clouds will ALWAYS cost more than bare metal. The cons is that with cloud you keep paying for the same servers: there are no assets anymore, only costs.

Corrado · on March 23, 2023

Not to mention keeping spare parts around for when something breaks, or having to drive out to the DC to fix/replace the thing that broke or won't restart. Hell, even something "simple" like managing the warranties for the gear you have is no fun at all. People tend to forget all those little things when espousing the evils of the cloud, but I'm here to tell you that they all add up and they are all a major pain in the butt. Cloud gets rid of all that.

0x500x79 · on March 23, 2023

There are also discussions around CapEX versus OpEx that apply here, and depreciating costs over time. There is a trade-off of agility, cost, and maintenance, but the markup on cloud is quite high.

tptacek · on March 23, 2023

The major determinant in hosting cost isn't power, it's the cost of the hardware. But I mean, even if you don't buy my axiomatic derivation, you can just work this out from AWS and GCP pricing.

Eduard · on March 23, 2023

> The major determinant in hosting cost isn't power.

Let's do the math then.

https://www.hetzner.com/news/neue-dedicated-server-2023/

Let's assume the Hetzner EX101 consumes 200 W (equals 0.2 kW) on average.

Let's assume a private household electricity price of 0.40 EUR/kWh (Germany).

The monthly electricity cost will be; 0.40 EUR/kWh x 0.2 kW x 24 h/day x 30 day/month = 57.60 EUR/month for electricity.

The Hetzner EX101 costs 84 EUR/month.

So even with self-energy production and cheap electricity buy prices, power / electricity is very significant.

fbdab103 · on March 23, 2023

At least in the US, businesses/commercial/industry typically get a significant discount on electrical pricing vs consumers.

donavanm · on March 23, 2023

I always saw it being close to 7:3 with non recurring hardware cost to mrc facilities & power on 3 year depreciation for major markets.

That said all of the big cloud providers SHOULD have a structural advantage on all of those dimensions. None if the small players or self hosting shops are doing the volume, much less the original r&d, of the big cloud providers. The size of that discount, and how costly it really is to achieve, is another topic.

Disclosure: principal at AWS. Above information is my personal opinion based on general experience of 20 years in the industry doing networking, compute farms, and operations.

cherioo · on March 23, 2023

Even if [0] cloud does have structural advantage, it’s clear that cloud vendor isn’t willing or wanting to pass them off to customers, and tends to nickel and dime on other necessity like the infamous bandwidth cost.

[0] I’m really curious how big, if any, structural advantage large cloud vendor has over small-time colo user, because surely cloud comes with all kinds of overhead? All the fancy feature AWS provides cannot be free. If customer does not care for those, would colo, or a small “vps” vendor, actually have structural advantage over AWS?

ivlad · on March 23, 2023

AWS’s 24% of operating margin does not appear out of nowhere.

harpratap · on March 23, 2023

If someone is making profit does not mean you are making a loss. Both consumer and producer could end up making profits depending on the situation

danielheath · on March 23, 2023

Cloud is reasonably cheap until you need to move data out (eg to serve html to customers). Egress charges are where the big players hit you.

mattgreenrocks · on March 23, 2023

The comments in this thread are quite eye-opening.

It really shows what a sacred cow k8s and cloud has become.

I’m not much of an ops person so I’m not qualified to comment on what 37 signals has created. But I will say I’m glad to see honest discussion around the costs of choosing k8s for everything even if it has significant mindshare.

Perhaps this is the endgame of resume-driven development: cargo culted complexity and everyone using the same tech for similar-ish problems and then wondering why it’s so hard to stand out from both a product and an employee perspective.

leetrout · on March 23, 2023

I am much of an ops person and I will say k8s has its place and its in 10-20% of companies max.

It is absolutely ridiculous how many places use k8s and rarely for what it can really do.

Cargo culted complexity, indeed.

mike_d · on March 23, 2023

Some people are really good at writing software, other people are really good at running systems. k8s/cloud allowed the former to pretend to be good at the latter.

jpgvm · on March 23, 2023

k8s is misunderstood. Everyone focuses on the complexity/over-engineering/etc arguments when those really don't matter in the grand scheme of things.

It's not about any of that, it's about having a consistent API and deployment target that properly delineates responsibilities.

The value of that then depends on how many things you are running and how many stakeholders you have taking part in that dance. If the answer to both of things are small then k8s value is small, if the answer to either of those is high then the value is high.

i.e k8s is about organisational value, it's technical merits are mostly secondary.

snupples · on March 23, 2023

The "it's too complex" argument usually reflects more on the commenter than on kubernetes itself. It's actually one of the most very straight forward and thoughtfully designed platforms I've ever worked with.

What I've found in my experience is that applications in general are complex -- more complex than people assume -- but the imperative style of provisioning seems to hide it away, and not in a good way. The inherent complexity hides behind layers of iterative, mutating actions where any one step seems "simple", but the whole increasingly gets lost in the entropic background, and in the end the system gets more and more difficult to _actually_ understand and reproduce.

Tools like ansible and terraform and kubernetes have been attempts to get towards more definition, better consistency, _away_ from the imperative. Even though an individual step under the hood may be imperative, the goal is always toward eventual consistency, which, really only kubernetes truly achieves. By contrast, MRSK feels to be subtly turning that arrow around in the wrong direction.

I'm sure it was fun to build, but one could have spent 1% of that time getting to understand the "complexity" of kubernetes - by the way, which quickly disappears once it's understood. Understandably, though, that would feel like a defeat to someone who truly enjoys building new systems from scratch (and we need those people).

erulabs · on March 23, 2023

You've hit the nail on the head. Ten thousand simple, bespoke, hand-crafted tools have the same complexity as one tool with ten thousand facets. The real velocity gained is that this one tool with ten thousand facets is mass produced, and in use widely, with a large set of diverse users.

I don't know a single person who's been responsible for infra-as-code in chef/terriform/ansible who isn't more or less in love with Kubernetes (once they get over the learning curve). Everyone who says "it's too complex" bears a striking resemblance to those developers who happily throw code over the wall into production, where it's someone else's issue.

> Understandably, though, that would feel like a defeat to someone who truly enjoys building new systems from scratch (and we need those people).

Exactly. Building new systems from scratch is tons of fun! It's just not necessarily the right business move, unless the goal was to get the front-page of HN, that is.

leetrout · on March 23, 2023

I'll take this bait:

Nomad is better for smaller teams and smaller companies with smaller problems than what k8s is for.

Helm is an abomination on top of it but that seems to be slowing down, thankfully.

erulabs · on March 23, 2023

I've been using Nomad for about 5 months now, and couldn't disagree more. K8s is better documented, with far less glue, and far more new-hire developers are familiar with K8s compared to Nomad. Nomad-autoscaler alone is becoming a decent reason not to use Nomad. The number of abandoned issues on the various githubs is another. That Vault is a first-class citizen of K8s and a red-headed-stepchild of Nomad is another.

I do agree about Helm tho, I avoid it as much as possible.

leetrout · on March 23, 2023

Fair enough, I don't know anything about nomad-autoscaler.

lmm · on March 23, 2023

I hate kubernetes as much as anyone, but building your own container orchestration platform so that you can deploy a handful of CRUD webapps sounds a lot more like resume-driven development than using a well-known and standard (if somewhat overengineered) solution.

tptacek · on March 23, 2023

I don't think the authors care about their resumes at this point. There are rational reasons to use a static scheduling regime and a set of conventions around deployment and support services rather than a dynamic scheduler. If it were me, I'd build this with Nomad, but I can imagine not wanting to manage a dynamic scheduler when your workloads are as predictable as theirs are --- you make their case for them when you point out that they just have a "handful of CRUD apps".

mgkimsal · on March 23, 2023

There are other options between k8s and “build your own”.

Jnr · on March 23, 2023

What is there really? There is docker swarm, which doesn't seem to be really further developed, and... what else?

This whole space seems to be neglected since cloud providers are trying to sell k8s to big company "devops"guys but old school sysadmins don't even know what docker is. Any development in this area is very welcome.

MrBuddyCasino · on March 23, 2023

> Perhaps this is the endgame of resume-driven development: cargo culted complexity and everyone using the same tech for similar-ish problems and then wondering why it’s so hard to stand out from both a product and an employee perspective.

Spot on. Tech is a fashion industry and most people just follow trends. I still sometimes wonder if people are playing the elaborate long-term resume-optimisation game, or if they don't value simplicity highly enough to optimise for it, because the downsides are externalised.

smrtinsert · on March 23, 2023

If there's discussion it isn't a sacred cow

waynesonfire · on March 23, 2023

k8s folks get paid big money to keep it running. Not surprised by the comments here at all. As the saying goes, "in complexity, there is opportunity." and the k8s devops team is milking it hard.

Valord · on March 24, 2023

Agreed. I was having similar feelings reading comments.

erulabs · on March 22, 2023

Only one sentence about why they chose to abandon K8s:

> It all sounded like a win-win situation, but [on-prem kubernetes] turned out to be a very expensive and operationally complicated idea, so we had to get back to the drawing board pretty soon.

It was very expensive and operationally complicated to self-host k8s, so they decided to build their own orchestration tooling? Sort of undercuts their main argument that this bit isn't even remotely fleshed out.

jameshart · on March 23, 2023

We are talking about 37Signals here. This is the company that, when faced with the problem of making a shared to-do list application, created Ruby on Rails. And when they decided to write up their remote working policy, published a New York Times bestselling business book.

This is not a company that merely shaves its Yaks. It offers a full menu of Yak barber services, and then launches a line of successful Yak grooming products.

madeofpalk · on March 23, 2023

...and don't forget the time where they wrote a blog post, kept posting through it, and a significant amount of the company quit.

ubercore · on March 23, 2023

I do forget that time -- what's the context?

madeofpalk · on March 23, 2023

https://www.theverge.com/2021/5/4/22419512/basecamp-politica...

dangwhy · on March 23, 2023

> significant amount of the company quit.

no they didn't

mr_ndrsn · on March 23, 2023

Yes, they did. This is not a debatable fact. IIRC, it was 30%+ of the company.

dangwhy · on March 23, 2023

i am sure you will supply proof for your claims.

mr_ndrsn · on March 23, 2023

I was at the company when it happened. I'm currently at the company. I'm in ops and work on all of the mrsk/de-clouding efforts.

rednerrus · on March 23, 2023

Has the political change lead to a better or worse work environment?

dangwhy · on March 24, 2023

ha right on! Must've be real awkward for the people who didn't quit in hottest tech job market of all times :D

jameshart · on March 23, 2023

https://www.theverge.com/2021/5/3/22418208/basecamp-all-hand...

dangwhy · on March 23, 2023

> at least 20 people — more than one-third of Basecamp’s 57 employees — had announced their intention to accept buyouts from the company.

Thanks for subjecting me to this crap article ( Which i presume you didn't bother to read.).

jameshart · on March 23, 2023

The article seems to provide evidence for the claim that a dispute within the company over the messaging from leadership led to 1/3 of the staff leaving. I provided it without comment.

Do you believe that a significant proportion of the staff did not quit? Do you have an alternative source that provides evidence for that version of events?

dangwhy · on March 23, 2023

intention to leave = staff leaving ?

then scarlett johanssen is my wife because i intend to marry her.

> Do you have an alternative source that provides evidence for that version of events?

Yes because people go around documenting evidence for things did not happen.

jameshart · on March 23, 2023

announced their intention to leave... to the company... in response to the company making an open offer to people of terms for them to leave.

That seems like a slightly different prior, in terms of our Bayesian assessment of the probability that those people remained employed at the company afterwards, than your hypothetical engagement to Ms Johannsen.

dangwhy · on March 23, 2023

> to the company

Where did you get this though?

> had announced their intention to accept buyouts from the company.

Is it just people clicking 'yes' reaction to internal slack message ? This didn't sound like they were making any commitment ' to the company' .

Also do you have any comment about the title of the article that you linked. Does that seem honest to you?

madeofpalk · on March 23, 2023

So strange to white-knight a company and attempt to deny something that happened pretty publicly...

> As a result of the recent changes at Basecamp, today is my last day at the company. I joined over 15 years ago as a junior programmer and I’ve been involved with nearly every product launch there since 2006.

https://web.archive.org/web/20210430155528/https://twitter.c...

https://web.archive.org/web/20210430140035/https://twitter.c...

https://twitter.com/zachwaugh/status/1388190748189802501

> I’m leaving my position at Basecamp, where I’ve worked for 4 years, due to the recent changes and new policies.

https://twitter.com/lexicola/status/1388189598367559688

https://twitter.com/dylanginsburg/status/1388199059983413257

https://twitter.com/jonasdowney/status/1388205182916440070

> Given the recent changes at Basecamp, I’ve decided to leave my job as Head of Design.

https://twitter.com/mackesque/status/1388206605506842627

https://twitter.com/kaspth/status/1380616358266871810

https://twitter.com/wcmoline/status/1388208323908968449

> I have left Basecamp due to the recent changes & policies.

https://twitter.com/conormuirhead/status/1388207801646780416

https://twitter.com/Rahsfan/status/1388209146487623681

https://twitter.com/AdamStddrd/status/1388223100823642112

dangwhy · on March 24, 2023

> So strange to white-knight a company and attempt to deny something that happened pretty publicly...

it was just skepticism from seeing these sorts of claims over the years. Half of hollywood would be in canada if people really followed up on those. At some point it became acceptable to make these sort of claims with no intention of following up.

I guess quitting your job in the hottest tech market of all time is a little different than moving to a different country.

madeofpalk · on March 23, 2023

> Last week was terrible. We started with policy changes that felt simple, reasonable, and principled, and it blew things up internally in ways we never anticipated. David and I completely own the consequences, and we're sorry. We have a lot to learn and reflect on, and we will. The new policies stand, but we have some refining and clarifying to do.

https://world.hey.com/jason/an-update-303f2f99

Dalewyn · on March 23, 2023

Wouldn't surprise me if they're doing all that with Yakety Sax[1] blaring in the background.

[1]: https://www.youtube.com/watch?v=ZnHmskwqCCQ

benatkin · on March 23, 2023

They seem to have lost their touch though. I think they peaked with Remote.

After typing that I found that they renamed from Basecamp Inc. back to 37signals and their website is trying to hearken to their past. https://en.wikipedia.org/wiki/37signals

Edit: lol https://37signals.com/22/

tptacek · on March 23, 2023

You could just look this up. They renamed to Basecamp because they decided to be a single-product company (at the same time, they divested Highrise and Campfire). Six years later, they launched HEY, their email product, so "Basecamp" stopped making sense as a name. They wrote a post about this last year.

later

I added "six years later", but I don't think it changes the meaning of what I wrote originally.

benatkin · on March 23, 2023

People were used to calling them 37signals, so even if that's the sole reason they renamed, it's more complex than that

tptacek · on March 23, 2023

Most HN people probably never stopped calling them 37signals, so this seems like an especially weird thing to get hung up on.

datadeft · on March 23, 2023

Are they dead-naming a company? Isn't that illegal in California already?

ulope · on March 24, 2023

That decision was always baffling to me. Basecamp is such a UX nightmare even Jira looks good next to it...

mattl · on March 23, 2023

I don’t think they were ever Basecamp Inc, I think they were always an LLC.

imiric · on March 22, 2023

Sometimes there's value in building bespoke solutions. If you don't need many of the features of the off-the-shelf solution, and find the complexity overwhelming and the knowledge and operational costs too high, then building a purpose-built solution to fit your use case exactly can be very beneficial.

You do need lots of expertise and relatively simple applications to replace something like k8s, but 37signals seems up to the task, and judging by the article, they picked their least critical apps to start with. It sounds like a success story so far. Kudos to them for releasing MRSK, it definitely looks interesting.

As a side note, I've become disgruntled at k8s becoming the defacto standard for deploying services at scale. We need different approaches to container orchestration, that do things differently (perhaps even rethinking containers!), and focus on simplicity and usability instead of just hyper scalability, which many projects don't need.

I was a fan of Docker Swarm for a long time, and still use it at home, but I wouldn't dare recommend it professionally anymore. Especially with the current way Docker Inc. is managed.

jameshart · on March 23, 2023

I think people overindex on thinking that Kubernetes is about scalability.

Honestly, its inbuilt horizontal scaling systems are pretty lacking. Scaling is not actually K8s's strong suit - sure, you can make it scale, but that takes effort and customization.

But what K8s, at base, is actually useful for is availability.

You tell K8s how many instances of a thing to run; it runs them; if any of them stop running, it detects that and tries to fix it.

When you want to deploy a new version, it replaces the old instances with new ones, while ensuring traffic still gets served.

And it does all of this over a substrate of shared underlying server nodes, in such a way that if any of those servers goes down, it will redistribute workloads to compensate.

All of that is useful even if you don't care about scale.