Regardless of the merits or drawbacks of "de-clouding" for this particular company, it seems to me that their ops team is just really bored or permanently unsatisfied with any solution.
They say that they've tried deploying their apps in all of:
* Their own Datacenter
* ECS
* GKE
* EKS
* and now back to their own Datacenter
Even with their new "de-clouded" deployment, it seems like they have created an absolutely immense amount of complexity to deploy what seems to be a variety of generic Ruby CRUD apps (I might be wrong but I didn't see anything different in the post).
They have a huge list of tools and integrations that they've tried out with crazy names; Capistrano, Chef, mrsk, Filebeat, traefik... It seems well on par complexity-wise with a full K8s deploy with all the bells and whistles (logging, monitoring, networking, etc.)
Google says that this company, 37signals, has 34 employees. This seems like such a monumental amount of orchestration and infra stuff unless they're deploying some crazy complex stuff they're not talking about.
Idk what the lesson is here, if there is one, but this seems like a poor example to follow.
We're talking about a product that has existed since 2004. They did:
* Their own data center, before Docker existed
* The non-K8s Docker way of deploying in AWS
* The GCP, and then AWS, ways of doing K8s Docker
* Docker in their own data center.
For 20 years of deployment, that doesn't look crazy to me.
The actual components of each environment are pretty standard. They wrote Capistrano, which predates Chef. Filebeat is a tiny part of ELK, which is the de facto standard logging stack. They use a smart reverse proxy, like... everybody else. It's easy to anything sound complicated if you expand the stack to as many acronyms as you can.
Also, it might be worth calling out: their product launched in 2004, Linode and Xen were launched in 2003, S3 and EC2 launched in 2007. The cloud as we know it today didn't exist when they started.
Pretty sure they knew the linode folks and were on there early iirc my history. This from hanging out with one of the linode owners back then randomly at a bar in stl
Whether DHH is "right" in some philosophical sense, this is a small company with a lot of technical experience in a variety of technologies and with presumably a lot of technical chops, so generalizing their experience to "cloud is good" or "cloud is bad" isn't really possible.
I mean, I work for a cloud hosting vendor. I'm not saying one side or the other is right, only that people who are dunking on 37signals for this are telling on themselves.
"their own datacenter" both previously and now almost certainly means renting bare metal or colocation space from a provider. I highly doubt they have physically built their own datacenter from scratch
"renting bare metal or colocation space from a provider"
Those are two totally, completely different things. Their own datacenter means their own equipment in a datacenter and could even mean building out their own datacenter. It never, ever means renting bare metal.
Weird, in my company where we are doing the opposite migration (from traditional datacenter where we manage the physical servers to Azure) this is exactly what we mean and say and how we describe it
We talk about "our datacenter" when we really mean racks of servers we rented from Insight, and we say "the cloud" when we refer to Azure. We've never actually had our own datacenter meaning a building we own and manage the entire physical plant of
Almost no one means it that way. Even Twitter is probably leasing colocation space in the "their own datacenter" category vs. GCP and AWS. The evidence is in the fact that Elon was able to just arbitrarily shut down an "entire datacenter". Or that 37signals was able to just arbitrarily move into "their own datacenter" on a whim
Referring to rented servers as colocated servers is flatly wrong, no matter how often people are incorrect about it. Sure, some providers put colocation under the same category as VMs and leased hardware, but that doesn't make them overlap.
OTOH, referring to a datacenter of servers that you lease as a datacenter is one thing, but if you have zero hardware that you own in it, would it really be your datacenter, or would it be "the datacenter"?
A datacenter could be anything from a set of IKEA shelves in a room with Internet and power to a fully built out fancy space with redundant power, fire suppression, a full Internet exchange, et cetera, so it's a bit gatekeepery to try to suggest that only huge companies would ever have their own datacenter or their own space with their own hardware in a datacenter.
The fun part is that they do not understand what it means to have your "own datacentert" vs renting server in a co-lo. It does not matter if you are running on AWS on Hetzner it is somebody else's computer.
We were a similar sized company at about the same time - we owned our data centers in the same way we owned our offices - we leased and occupied them. Sure, if the plumbing sprouted a leak the landlord would come to in and fix it, but no one would be confused enough to say we didn’t have our own office space.
"The fun part is that they do not understand"
YES, 37Signals, I company with a legendary pedigree of pushing technical boundaries and open minded with deployment models totally doesn't know the simple thing that you do.
I don't understand how the first clause in this sentence connects to the second.
With a simple, predictable workload --- what they have --- it can make sense to lean towards static scheduling, rather than dynamic schedulers. K8s and Nomad are both dynamic schedulers.
This is pretty basic stuff; it's super weird how urgently people seem to want to dunk on them for not using K8s. It comes across as people not understanding that there are other kinds of schedulers; that "scheduling" means what Borg did.
We did! And it did work. And there are def some great things that I (we) love about k8s. Personally, the declarative aspect of it was chef's kiss. "I want 2 of these and 3 of these, please", and it just happens.
Which is the primary reason why we did investigate k8s on-prem. We had already done the work to k8s-ify the apps, let's not throw that away. But running k8s on-prem is different than running your own k8s in the cloud is different than running on managed k8s in the cloud.
Providing all of the bits k8s needs to really work was going to really stretch our team, but we figured with the right support from a vendor, we could make it work. We worked up a spike of harvester + rancher + longhorn and had something that we could use as if it were a cloud. It was pretty slick.
Then we got the pricing on support for all of that, and decided to spend that half million elsewhere.
We own our hardware, we rent cabs and pay for power & network. We've got a pretty simple pxeboot setup to provision hardware with a bare OS that we can use with chef to provide the common bits needed.
It's not 'ultimately flexible in every way', but it's 'flexible enough to meet the needs of our workloads'.
What is your position at 37Signals and how do you like it? I'm really impressed by the innovation that comes out of you guys and the workplace culture you folks have.
Bare vanilla k8s or k3s is nice but it doesn't do much outside of your homelab. Once you want k8s on production in the cloud you have to start about thinking of:
- loadbalancing and ingress controller
- storage
- network
- iam and roles
- security groups
- centralized logging
- registry management
- vulnerability scanning
- ci/cd
- gitops
And all this is no less complex with k8s than with nomad, bare docker or whatever they chose. And definitely no less complex because it is on a major cloud provider.
Hey Melingo, I noticed that you responded to a lot of different threads in this post. It seems like you are a bit dismissive of people's experiences using K8s. I have also run K8s at scale, and it is not easy, it is not out of the box in cloud providers. There are a ton of addons, knobs, and work that has to be doen to build a sustainable and "production ready" version of K8s (for my requirements) in AWS.
K8s is NOT easy, and I do not believe that in it's current form it is the pinnacle of deployment/orchestration technologies. I am waiting for what is next, because the pain that I have personally experienced around K8s that I know others are feeling as well does not make it a perfect solution for everything, and definitely not usable for others.
At the end of the day it's a tool, and it is sometimes difficult to work with.
I know you are sharing your experience, others are as well. Let's not dismiss other's experience just because it doesn't match our own, the truth is most likely somewhere in the middle. Especially when so many people are clamoring saying that they had pain using K8s.
The initial deployment for EKS requires multiple plugins to get to something that is "functional" for most production workloads. K8s fails in spectacular ways (even using Argo, worse using Argo TBH) that require manual intervention. Local disk support for certain types of workloads is severely depressing. Helm is terrible (templating Yaml... 'nuff said). Security groups, IAM roles, and other cloud provider functions require deep knowledge of K8s and the cloud provider. Autoscaling using Karpenter is difficult to debug. Karpenter doesn't gracefully handle spot instance cost.
I could go on, but these are the things you will experience in the first couple days of attempting to use k8s. Overall, if you have deep knowledge of K8s, go for it, but It is not the end-all solution to Infra/container orchestration in my mind.
I fought with a workload for over a day with our K8s experts, it took me an hour to deploy it to an EC2 ASG for a temporary release while moving it back to K8s later. K8s IS difficult, and saying it's not has a lot of people questioning the space.
The way I see it is it starts off easy, and quickly ramps up to extremely complex. This should not be the case.
I worked at a company that had their own deployment infra stack and it was 1000x better than K8s. This is going to be the next step in the K8s space I believe and it may use K8s underneath the covers, but the level of abstraction for K8s is all wrong IMO and it is trying to do too much.
The main issues we faced with over 700VMs were: outdated os, full disks, full inodes, broken hardware, missing backups or missing backup strategy, oom.
K8s health itself, fixes out of memory by restarting a pod, solves storage by shipping logs out and killing a pod in case it still runs full, has a rollout startegy, health checks and readiness probes.
It provides easy deployment mechanism out of the box, adding a domain is easy, certificates get renewed centrally and automatically.
Scaling is just a replica number and you have node Autoupgrade features build in.
K8s provides what people build manually out of the box, certified, open sourced and battle tested.
> Alone the Paradigma shift from doing things step by step vs describing what you need and than things happen on it is a game changer.
I've actually used both in conjunction and it was decent: Ansible for managing accounts, directories, installed packages (the stuff you might actually need to run containers and/or an orchestrator), essentially taking care of the "infrastructure" part for on-prem nodes, so that the actual workloads can then be launched as containers.
In that mode of work, there was very little imperative about Ansible, for example:
- name: Ensure we have a group
ansible.builtin.group:
name: somegroup
gid: 2000
state: present
- name: Ensure that we have a user that belongs to the group
ansible.builtin.user:
name: someuser
uid: 3000
shell: /bin/bash
groups: somegroup
append: yes
state: present
This can help you setup some monitoring for the nodes themselves, install updates, mess around with any PKI stuff you need to do and so on, everything that you could achieve either manually or by some Bash scripts running through SSH. Better yet, the people who just want to run the containers won't have to think about any of this, so it ensures separation of concerns as well.
Deploying apps through Ansible directly can work, but most of the container orchestrators might admittedly be better suited for this, if you are okay with containerized workloads. There, they all shine: Docker Swarm, Hashicorp Nomad, Kubernetes (K3s is really great) and so on...
I'm on GKE. The hosts and control plane are managed for me. All I need to do is build/test/security scan images and then promote/deploy the image (via Helm) when it goes out to prod.
Using config management and introducing config drift and management of the underlying operating system is a lot more to think about, and a lot more that can go wrong.
So you did automatisation in a broken way. Here's one way to avoid the issues you described on bare metal:
- Only get servers with IPMI so you can remote reboot / power cycle them.
- Have said servers netboot so they always run the newest OS image.
- Make sure said OS image has a config that isn't broken so you don't get full inodes and so it cycles logs.
- Have the OS image include journalbeat to ship logs.
- Have your health checks trigger a recovery script that restarts or moves containers using one of a myriad of tools; monitoring isn't exactly a new discipline.
Yes, it means you have to have a build process for OS images. Yes, it means you need to pick a monitoring system. And yes, it means you need to decide a scheduling policy.
I wrote an orchestrator pre-K8S that was fewer LOC than the yaml config for my home test K8S cluster. Writing a custom orchestrator is often not hard, depending on your workload, - writing a generic one is.
K8S provides one opinionated version of what people build manually, and when it's a good fit, it's great. When it isn't, I all to often see people spend more time trying to figure out how to make it work for them than it would've taken them to do it from scratch.
I ran 1000+ VMs on a self developed orchestration mechanism for many years and it was trivial. This isn't a hard problem to solve, though many of the solutions will end up looking similar to some of the decisions made for K8S. E.g. pre-K8S we ran with an overlay network like K8S, and service discovery, like K8S, and an ingress based on Nginx like many K8S installs. There's certainly a reason why K8S looks the way it does, but K8S also has to be generic where you can often reasonably make other choices when you know your specific workload.
And you don't think k8s made your life much easier?
For me it's now much more about proper platform engineering and giving teams more flexibility again knowing that the k8s platform is significantly more stable than what I have ever seen before.
No, I don't for that setup. Trying to deploy K8S across what was a hybrid deployment across four countries and on prem, colo, managed services, and VMs would've been far.more effort than our custom system was (and the hw complexity was dictated by cost - running on AWS would've bankrupted that company)
> They have a huge list of tools and integrations that they've tried out with crazy names; Capistrano, Chef, mrsk, Filebeat, traefik
These tools are pretty stock standard in the Systems Engineering world. I think anyone that's been a Systems Engineer that's over 30 has probably deployed every one of these.
One thing I've learned over my mixed SWE and SE career is that infrastructure is expensive and grows regardless of revenue. I didn't truly appreciate this until I launched Kubernetes on Digital Ocean and began running my personal cloud on it. It was costing me over $100/m for very little. That money was gone whether I pushed a ton of VPN traffic over my mesh or not. It didn't care about how much I stored in the disk I reserved, and frankly, that cost was going to grow as time went on. I pulled the plug, setup servers in my house, wired up Traefik and Docker Compose V2 with a little Tailscale sprinkled on. The servers stay up to date with some scripts, I deploy new apps on select servers with Docker Compose and Docker profiles.
It's possible for companies to do similar things, but not to the extremes I took it to. A really good infrastructure SWE generally goes for $300k. You can pay people with an expertise in these things who can streamline them and create maintainable products out of your infrastructure or you can pay for Legos and glue from a managed service provider like AWS, GCP, or Azure. At some point the latters costs will not scale, you'll pivot and cost reduce many times - maybe even begin rearchitecting. I think there's a lot of companies that are noe realizing the cheap money is gone, and the cloud has somewhat relied on cheap money.
This is the company that gave birth to Ruby on Rails. They appear to have a culture of being very opinionated about their tools and unafraid of doing things their own way.
Probably not an example most companies that size ought to follow but I'm glad they were crazy enough to do it!
I think you’re right that they doing it for fun or because they can, primarily. But I am excited to see them pioneer in this area, both because it’s more open and hacker friendly, but also because they’re moving the needle towards healthy competition amongst the providers.
Our big-three cloud hegemony has already showed it’s ugly sides, both in terms of price (egress, anyone?) and quality (hello, zero interop and opaqueness). I’d argue we’ve seen significant complexity increases in especially server-side tech in the last 5-10 years, with relatively little to show for it, despite massive economic investments. I expect that trend to continue downwards unless we take back the hacker friendliness of infra & ops.
PS. Actually scratch that I’m excited, that’s an understatement. I’m thrilled!
I wonder how much of these movements are them iterating and hunting for ROI in their infrastructure costs. Did GCP and AWS salespeople sell them on the benefits of the cloud, offer discounts, white glove migration help, showed some calculation on how much $$ they will save in the cloud, etc that on paper sounded great, but wasn’t ultimately a good fit?
Their market is probably saturated and perhaps declining that they are reaching for optimizations elsewhere.
There is no such thing as "saving money in cloud".
It is all about convenience and it always costs more than a smart team could achieve elsewhere.
I tend to hear an argument that it is cheaper since you do not have to pay people to maintain those services, but in reality you still need that person to set up and maintain your particular cloud setup. And the services themselves are much much more expensive than maintaining your own servers in a data center.
In my opinion cloud hosting and services are more meant for large corporations where no one wants to take responsibility and is scared of doing anything. Cloud is a nice way to shift the blame if/when things go bad - "but cloud is industry standard, everyone does it".
Hacker news crowd is drinking their own cool aid on this topic and not recognizing how much costs can be avoided if they just drop EKS from their stack.
Remember that in SRE all the abstractions are leaky and thus having more abstractions means having more complexity not less.
When I read stuff like this it strikes me that probably, by far, their largest operational expense is their staffing cost to orchestrate all of this. I come from a background of running small startups on a shoe string budget. I need to make tough choices when it comes to this stuff. I can either develop features or start spending double digit percentages of my development budget on devops. So, I aim to minimize cost and time (same thing) for all of that. At the same time, I appreciate things like observable software, rapid CI/CD cycles, and generally not having a lot of snow flakes as part of my deployment architecture. I actually have worked with a lot of really competent people over the past two decades and I like to think I'm not a complete noob on this front. In other words, I'm not a naive idiot but actually capable of making some informed choices here.
That has lead me down a path of making very consistent choices over the years:
1) no kubernetes and no microservices. Microservices are Conways Law mapped to your deployment architecture. You don't need that if you do monoliths. And if you have a monolith, kubernetes is a waste of CPU, Memory, and development time. Complete overkill with zero added value.
2) the optimal size of a monolith deployment is 2 cheap VMs and a load balancer. You can run that for tens of dollars in most popular cloud environments. Good enough for zero down time deployments and having failover across availability zones. And you can scale it easily if needed (add more vms, bigger vms, etc.).
3) those two vms must not be snow flakes and be replaceable without fanfare, ceremony, or any manual intervention. So use docker and docker-compose on a generic linux host, preferably of the managed variety. Most developers can do a simple Dockerfile and wing it with docker-compose. It's not that hard. And it makes CI/CD really straight forward. Put the thing in the container registry, run the thing. Use something like Github actions to automate. Cheap and easy.
4) Use hosted/managed middleware (databases, search clusters, queues, etc). Running that stuff in some DIY setup is rarely worth the development time and operational overhead (devops, monitoring, backups, upgrades, etc). All this overhead rapidly adds up to costing more than years of paying for a managed solution. If you think in hours and market conform rates for people even capable of doing this stuff, that is. Provision the thing, use the thing, and pay tens of dollars per month for it. Absolute no brainer. When you hit thousands per month, you might dedicate some human resources to figuring out something cheaper.
5) Automate things that you do often. Don't automate things that you only do once (like creating a production environment). Congratulations, you just removed the need for having people do anything with teraform, cloudformation, chef, puppet, ansible, etc. Hiring people that can do those things is really expensive. And even though I can do all of those, it's literally not worth my time. Document it, but don't automate it unless you really need to and spend your money on feature development.
But when I need to choose between hiring 1 extra developer or paying similarly expensive hosting bills, I'll prefer to have the extra developer on my team. Every time. Hosting bills can be an order of magnitude cheaper than a single developer on a monthly basis if you do it properly. For reference, we pay around 400/month for our production environment. That's in Google cloud and with an Elastic Cloud search cluster included.
Other companies make other choices of course for all sorts of valid reasons. But these work fine for me and I feel good about the trade offs.
Agree entirely. I think system design interviews are partly to blame because they select for people who think that the only way to design a system is the cargo cult method that interview prep books and courses preach, which is:
- break everything into microservices
- have a separate horizontally scalable layer for load balancing, caching, stateless application server, database servers, monitoring/metrics, for each microservice.
- use at least two different types of databases because it's haram to store key-value data in a RDBMS
- sprinkle in message-passing queues and dead-letter queues between every layer because every time you break one system into two, there can be a scenario where one part is down but the other is up
- replicate that in 10 different datacenters because I'll be damned if a user in New York needs to talk to a server in Dallas
And all this for a service that will see at most 10k transactions per second. In other words, something that a single high-end laptop can handle.
99.9% of the time your architecture does NOT need to look like Facebook's or Google's. 99% of tech startups (including some unicorns) can run their entire product out of a couple of good baremetals. Stop selecting for people who have no grounding of what is normal complexity for some given scale.
I can't agree more on this. Most products out there with medium to low traffic can be handled just fine like this. The cost of automation is often not worth the financial effort.
There's a dangerous trend in putting microservices everywhere. Then having the same level of quality as a monolith requires an infinite amount of extra work and specialized people. Your product must be very successful to justify such expenses!
My rule of thumb; monolith and PaaS as long as your business can afford to.
I mean it all makes sense if you know nothing of k8s or ansible.
Most companies these days had moved to k8s so there are a portion of hi tech workers that have prior knowledge of k8s model and deployment.
Whether you want to go monolith or not doesn't matter because you need to replicate the process at least to 2 environment: dev and prod. Not to mention it's good to be prepared had your prod env got compromised or nuked.
Where, oh god where, are there more sensibly thinking people like you! This is pragmatic and straight forward. There is very little room for technical make work nonsense in your described strategies. Most places, and many devs I meet cannot imagine how to do their jobs without a cornucopia of oddly named utilities they only know a single path of use.
This is actually a really interesting post to me. I'm currently working at the opposite of a startup with a shoestring budget. We're a medium-sized company with 100 - 150 techies in there. As a unique problem, we're dealing with a bunch of rather sensitive data - financial data, HR data, forecast and planning data. Our customers are big companies, and they are careful with this data. As such, we're forced to self-host a large amount of our infrastructure, because this turns from a stupid decision into a unique selling point in that space.
From there, we have about 7 - 12 of those techies working either in my team, saas operations, our hardware ops team, or a general support team for CI/images/deployment/buildserver things. 5 - 10% of the manpower goes there, pretty much.
The interesting thing is: Your perspective is our dream vision for teams running on our infrastructure.
Like - 1 & 2 & 3: Ideally, you as the development team shouldn't have to care about the infrastructure that much. Grab the container image build templates and guidelines for your language, put them into the template nomad job for your stuff, add the template pipeline into your repository, end up with CD to the test environment. Add 2-3 more pipelines, production deployments works.
These default setups do have a half life. They will fail eventually with enough load and complexity coming in from a product. But that's a "succeed too hard" kinda issue. "Oh no, my deployment isn't smooth enough for all the customer queries making me money. What a bother" And honestly, for 90% of the products not blazing trails, we have seem most problems so we can help them fix their stuff with little effort to them.
4 - We very much want to standardize and normalize things onto simple shared services, in order to both simplify the stuff teams have to worry about and also to strengthen teams against pushy customers. A maintained, tuned, highly available postgres is just a ticket, documented integrations and a few days of wait away and if customers are being pushy about the nonfunctional requirements, give them our guarantees and then send them to us.
The only point I disagree with is Terraform. It is brilliant for this exact scenario because it's self documenting. When you do need to update those SPF records in two years time, having it committed as a Terraform file is much better than going through (potentially stale) markdown files. It's zero maintenance and really simple. Plus its ability to weave together different services (like configuring Fastly and Route53 from the same place) is handy, too.
What if I do this with Terraform using AWS Serverless and staying in the free tier for this workload that you are referencing instead of VMs and a load-balancer?
I just don't see why people prefer the VM based approach over serverless.
There is usually a sweet spot in terms of size where being on the public cloud makes sense, both from a cost and management perspective. Once you go above that size then having to manage IAM starting becoming a pain. Usually around the same point public cloud costs start becoming noticeable to your finance team and so you have to start dealing with questions around that. Usually that's a good point to do a sanity check before things get even bigger and more expensive.
Similar k8s works well for certain classes of problem, but doesn't work well for all classes of problem. Any form of k8s has an operational overhead and you really need to make sure that you are going to get the ROI from the effort of maintaining the stack for it to be worthwhile.
> They have a huge list of tools and integrations that they've tried out with crazy names; Capistrano, Chef, mrsk, Filebeat, traefik
I use a lot of this or similar (terraform instead of Chef, logstash instead of filebeat) and I'm a one person team. If anything these tools make my job a lot easier and less complex.
This is very common in almost all web companies since around 2015.
I've never seen a company with a simple infrastructure, no matter how simple their actual application actually is.
If you choose a slow dynamic language (Ruby/Python) your deployment has to be massively complicated; you have no choice about it.
For one simple reason: you will need a multitude of separate components to be made to work together.
You need many application instances because there's no way one machine can handle all your traffic: Ruby is just too slow.
A sharded database cluster as a source of truth:
You went through the effort of making several applications instances with a load balancer: you don't want a single database server to be a single point of failure.
A distributed redis/memcache index to accelerate queries and lower the pressure on the real database.
You might have several index-type engines for different types of quries. Most people use ElasticSearch in addition to Redis.
You need some system to manage all this complexity: monitor it, deploy new versions, rollback to a previous version, run migrations and monitor their state, etc etc.
This is the bare minimum. Most people have a setup that is way more complicated than this. I don't even know how they come up with these complexties, but they not only come up with frequently: they love it! To them it's a feature, not a bug.
You are making a lot of assumptions and many of those are not universal problems or even at all.
Compiled languages eventually need a complicated setup for the very same reasons. There is no such thing as "scales" and "doesn't scale". Even Go or C++ webapps have to be scaled up.
If you can get away without complexity on Go or whatever, good for you. Most companies don'T.
It's way too complicated. But if this is all you have ever seen and if you've been designing such systems for a decade, this seems like normal to you.
Here's an alternative stack that can handle over 99% of websites:
- Self contained executable
- One-file database
- Cache is memory
- Text search is a library function
- Indexing is a library function
- Serving http is a library function
Such a stack can handle > 20k concurent connections (per second). The code doesn't need to be "optimized"; just non-pessimized.
You can scale "vertically" for a very long time, specially in 2020 and beyond, where you have machines with over 64 CPU cores. That's almost a cluster of 64 machines, except in one single machine.
If you _must_ scale horizontally for some reason - maybe you are Twitter/Facebook/Google - then you can still retain the basic architecture of a single executable but allow it to have "peers" and allow data to be sharded/replicated across peers.
Again all the coordination you need is just a library that you embed in the program; not some external monster tool like k8s.
1) a single panic/exception/segfault in the executable brings down the whole website and so it will be unavailable until the executable restarts
2) entropy *always* increases (RAM usage, memory corruption, hardware issues, OS misconfiguration etc.) so eventually the application will break and stop serving traffic until it's repaired/restarted (which can take time if it's a hardware issue)
3) deployments are tricky if there's nothing before the executable (stop, update, restart => downtime)
4) if cache is in-process, on a restart it will have to be repopulated from scratch, leading to temporary slowdowns (+ and maybe a thundering herd problem) which will happen *every time* you deploy an update
I think much of it is ignoreable if the site is just a personal blog or a static site. But if the site is a real time "web application" which people rely on for work, then you still need:
1) some kind of containerization, to deal with inevitable entropy (when a container is restarted, everything is back to the initial clean state)
2) at least two instances of the application: one instance crashes => the second one picks up traffic; or during rolling updates: while one instance is being killed and replaced with a new version, traffic is routed to another instance
3) persistent data (and sometimes caches) need to be replicated (and backed up) -- we've had many hardware issues corrupting DBs
4) automatic failover to a different machine in case the machine is dead beyond repair
>not some external monster tool like k8s
What can you use instead of k8s for this kind of scenario? (an ultra reliable setup which doesn't need a whole cluster)
It seems to me that people tend to vastly overestimate their uptime requirements. "Real time 'web application'" used by hundreds of millions of people can be down for hours and yet succeed wildly, just look at Twitter, both its old failwhale and new post-Musk fragile state. Complexity, on the other hand, and thus lower iteration speed and higher fixed costs can kill a business much easier than a few seconds of downtime here and there.
You don't need an "ultra reliable setup" or even a "cluster". You can have one nginx as a load balancer pointing at your unicorn/gunicorn/go thing, it's very unlikely to ever go down. You can run a cronjob with pgdump and rsync, in an off chance your server dies irrecoverably corrupting the DB (which is really unlikely for Postgres), chances are your business will survive fifteen minutes old database.
Most "realtime web applications" are not aerospace, even though we like to pretend that's what we work on. It's an interesting confluence of engineering hubris and managerial FOMO that got us here.
> It seems to me that people tend to vastly overestimate their uptime requirements. "Real time 'web application'" used by hundreds of millions of people can be down for hours and yet succeed wildly
That may be true for social media apps where the Terms of Service don't include any SLAs/SLOs, but if you're a SaaS company of any kind, the agreements with clients often include uptime requirements. Their engineers will often consider some form of "x number of nines" industry standard.
In the projects I work on, things go down all the time, for various reasons (hardware issues, networking problems, cascading programming errors). It's the various additional measures we have put in place which prevent us from having frequent outages... Before the current system was adopted, poor stability of our platform was one of the main complaints.
I agree that for many projects it may be an overkill.
Networking issues and even hardware issues are very unlikely if you can fit everything into one box, and you can get a lot in one box nowadays (TB+ RAM, 128+ core servers are now commodity). MTBF on servers is on the order of years, so hardware failure is genuinely rare until you get too many servers into one distributed system. And even then, two identical boxes (instead of binpacking into a cluster, increasing failure probability) go a very long way.
It's a vicious circle. We build distributed multi-node systems, overlay software-configured networks, self-healing clusters, separate distributed control planes, split everything into microservices, but it all makes systems more fragile unless enough effort is spent on supporting all that infrastructure. Google might not have a choice to scale vertically, but the overwhelming majority of companies do. Hell, even StackOverflow still scales vertically after all these years! I know startups with no customers who use more servers than StackOverflow does.
If there's a bug that brings the server down, it will happen in all instances and repeatedly no matter how many times you restart. Specially when the users keep repeating the action that triggered the crash.
Re: Entropy. Entropy increases with complex setup. The whole point of not having a complex setup is to reduce entropy and make the system as whole more predictable.
Re: caches. There are two types of caches: indicies that are persisted with the database, and LRU caches in memory. LRU caches are always built on demand so this is not even a problem.
Plus modern CPUs are incredibly fast and can process several GBs of data per second. Even in the worst cases, you should be able to rebuild all your caches in a second.
>If there's a bug that brings the server down, it will happen in all instances and repeatedly no matter how many times you restart.
Not necessarily so. Many bugs are pretty rare bugs which are triggered only under specific conditions (a user, or the system, must do X, Y, Z at the right moment). So it doesn't happen all the time. But when it happens, the whole server crashes or starts behaving in a funky way and other users are affected. Sure you may say if it's a rare bug, then users will be rarely affected. But we don't have a single bug like that, there's always N such bugs lurking around (we never know how many of them in a large application); multiply it by N bugs and you have server crashes for different reasons quite often, making your paying customers dissatisfied. It also assumes you can fix such a bug immediately while it's not always true, there's often Heisenbugs it takes weeks to root out and fix, while your customers are affected (sure the application will restart but ALL users (not just the one who triggered the bug) can loose work, get random errors when the app is not available -- not a good experience). So having several app instances for backup allows to soften such blows, because there will always be at least one app instance which is available.
>Entropy increases with complex setup. The whole point of not having a complex setup is to reduce entropy and make the system as whole more predictable
I agree that entropy increases with complex setup, but there's also base entropy which accumulates simply because of time (which I think is more dangerous). Like make a sufficient number of changes to the setup of your application (which you often need if you release often) and eventually someone or something somewhere will make a mistake or expose a bug somewhere, and you will need to repair it and you won't be able do it easily because your setup is not containerized which would allow to return to the clean state quite easily with no effort. We've had issues like that with our non-containerized deployments and it's a very complex and error-prone undertaking to do it flawlessly (no downtime or regressions) compared to containerized deployments.
>Plus modern CPUs are incredibly fast and can process several GBs of data per second. Even in the worst cases, you should be able to rebuild all your caches in a second
Hm, usually caches are placed in front of disk-based DB's to speed up I/O, i.e. it's not a matter of slow CPU's, it's a matter of slow I/O. Rebuilding everything which is in the caches from DB sources is not super fast.
> and you will need to repair it and you won't be able do it easily because your setup is not containerized which would allow to return to the clean state quite easily with no effort.
Automated deployment including server bringup is orthogonal to using containers or hot failover. For example at $WORK we're deploying Unreal applications to bare metal windows machines without using containers because windows containers aren't as frictionless as linux ones and the required GPU access complicates things further.
Upfront customer requirements often say they want >99.5% uptime (which allows for 3.5h downtime a month anyway) or some such. In practice B2B customers often don't care much if hour-long downtimes happen every week during off-hours. Sometimes they're even ok when it gets taken down over a whole weekend.
Things serving the general public have different requirements but even they have their activity dips during the late night where business impact of maintenance is much lower.
> 2) entropy always* increases (RAM usage, memory corruption, hardware issues, OS misconfiguration etc.) so eventually the application will break and stop serving traffic until it's repaired/restarted (which can take time if it's a hardware issue)*
This is not what entropy means. Even if you constrain it to hardware, there is no reason to think that this will happen eventually, unless your timeline is significantly long.
What text search will provide me with the same features as Elasticsearch? Index time analysis, stemming, synonyms; search time expansions, prefix matching, filtering and (as a separate feature) type ahead autocomplete?
I would love to never touch another Elasticsearch cluster so this is a genuine question.
This is the Java library that ES is based on. Without even having to look at it I can make the following judgement:
It should be easy to port to any language.
It's open source, and it's Java. Java has no special features that makes it impossible or particularly difficult to replicate this functionality in any other compiled language, like C, Rust, Go, or any other language that is not 100x wasteful of system resources.
Based on, but Elasticsearch is not just a server wrapped around the library. Features ES has are not in Lucene, otherwise anyone could release a competitor by wrapping the library.
> It should be easy to port to any language.
You win the "Most Hacker News comment of March 2023" award. This thread is talking about less effort, and you bring up porting Lucene to another programming language.
> Based on, but Elasticsearch is not just a server wrapped around the library. Features ES has are not in Lucene, otherwise anyone could release a competitor by wrapping the library.
Go is not less wasteful than java, both are garbage collected and their memory pressure depend highly on the given workload, and the runtime of the program. But java allow more GC tuning and even different GCs for different use cases (ie: shenadoah and ZGC favor very low latency workloads, while the default G1GC favors throughout (not that simple, but you get the point))
Regardless, Java/Go tier of performance is good enough for this kind of thing.
Problem is it doesn't support HA. You're stuck on that single server model. Upgrades always = downtime = painful. You're also missing things like self-healing and your Lucense index can corrupt.
Real world experience says better to move away from it e.g. lots of self-hosted Atlassian instances over the years. Lucene was a major pain point.
Thanks for the reminder. Manticoresearch is an alternative I haven't tried yet. I tried the hip alternatives (Melisearch, Typesense) in autumn 2022 and both were severely lacking for CRM workloads compared with ES.
You can always put an LRU cache between you and SQLite.
I personally moved from SQLite to a B-Tree based key-value store, and most requests can be serviced in ~500us (that is microseconds). I don't mean a request to the database: I mean a request from the client side that queries the database and fetches multiple objects from it. In my SQLite setup, the same kind of query would take 10ms (that is 20x the time) even _with_ accelerator structures that minimize hits to the DB.
But you can always scale up vertically. You can pay $240/mo for 8 vCPUs with 32GB of RAM. Much cheaper than you would pay for an elastic cloud cluster.
500us is slow. This kind of performance does not remotely obsolete an LRU cache (main memory access is ~5000X faster).
500us is essentially intra-datacenter latency. Obviously your data is in memory on the B-Tree server as there is no room in this budget for disk IO. Postgres will perform just as well if data is in memory hitting a hash index (even B-Tree probably). I don't think the B-Tree key-value store you mention is adding much. Use Redis or even just Postgres.
When you say text indexing and serving http are library functions, what do you mean? Also, is the language here go or what? Since you said python is too slow and then necessitates all the infra to manage it.
Go or any language that actually gets compiled down to machine code to get executed directly on the hardware, and where libraries are compiled into the final product.
When I say something is a library function, I mean you just compile it into your code. In your code, you just call the function.
This is in contrast to the current defactor practice of making an http request to ask another program (potentially on a different machine) to do the work.
Sometimes I think, maybe our complex cluster which runs PHP software (load balancer, app instances, cache etc.) can be replaced with a single performant machine running something like Rust
It can. You don't even need to go all the way to Rust. I'm doing it with Go, which has a GC and a runtime. A single executable on a single machine can handle millions of users per month.
Each of these "flip flops" probably lasted a good deal longer than the median 20+ person startup, so that seems pretty facile. But the parallel with CoffeeScript seems valid --- people on message boards are really not OK with nonstandard languages, and are never less happy than when a company they've heard something about does actual computer science of any kind. See, for instance, Fog Creek and Wasabi.
Skimming the thread here, it seems like there's some confusion about the goals:
* They've decided to move from EKS to on-prem largely because of cost. That's logical: almost by definition, it costs more to run workloads on cloud machines than on your own hardware. You can't address that problem by moving from EKS back to ECS, like one commenter suggested.
* They've decided to move from K8s to mrsk, a system they developed. They're fuzzier about why they did that, but the two fairly clear claims they made: (1) their deployments under K8s are a lot more complicated, and (2) they slashed their deploy times (because a great deal of their infra is now statically defined).
I feel like there's more productive debating to be done about K8s vs mrsk than there is about EKS vs. mrsk. By all means, make the case that applications like Tadalist are best run on K8s rather than a set of conventions around bare-metal container/VMs (which is all mrsk is).
Yeah, I would love to hear more about why they decided not to go with on prem k8s... the other arguments made logical sense to me, but they don't explain the reasoning for mrsk very well.
Every company that I have been at that uses k8s at scale ends up having an internal team to manage the complexities and build internal tooling to make it work. It sounds like they left behind a lot of the cruft and just built a tool that does the one thing most people want: put a container on a VM and call it good.
That's the thing. On-prem K8s doesn't mean deploying a vanilla Kubernetes using instructions from kubernetes.io. There is an entire industry of proprietary solutions for running Kubernetes on-prem. RedHat Openshift, Rancher, Pivotal PKS, VMWare Tanzu come to mind.
I don't know when they decided to do that transition but back when I tried rancher a few years ago (when they were transitionning from rancher 1.x to 2.x) it was a real bug festival. I think the only robust solution at the time was openshift which was well, k8s without being vanilla k8s.
Also most tools that were build to manage k8s cluster were nice to deploy a new cluster, not so much to upgrade a cluster so you would have to create new clusters every time you wanted an upgrade. It can scale when clusters and blast radius is small but can be complicated when it involves contributions from n teams.
For this reason when we were managing our own k8s cluster on prem, we were using kubespray which worked but upgrades were a multiple hours affair.
That's a real good point you mentioned: k8s ecosystem is super young.
And so so much changed in the last 4 years.
But at least for me, the 'easy to use' threshold happend somewhere like 2-3 years ago.
And Gardner for example upgrades quite well.
Rke2 is quite stable for me but rancher integration is still not perfect.
But even doing k8s by hand with Ansible was already double 3 years ago. That's what I started and I had it up and running. I switched to rke2 because I realized that this will not be sustainable/ is not worth it to do it myself on this level.
I haven't used k8s in quite a few years, what would you recommend I look at these days to get a good overview and understand all the different pieces in the ecosystem?
> By all means, make the case that applications like Tadalist are best run on K8s rather than a set of conventions around bare-metal container/VMs (which is all mrsk is).
Okay sure I'll bite. An application like Tadalist is best run on k8s.
With any application regardless of how it runs, you generally want at least:
- zero-downtime deploys
- health checks
- load balancing
Google's GKE is like $75/mo, and the free tier is one cluster, which is enough. For nodes, pick something reasonable. We're naive so we pick us-west1 and a cheap SKU with 2 vCPUs 8 GB is ~$30/mo after discounts. We're scrappy so we eschew multiple zones (it's not like the nearby colo is any better) so let's grab two of these at most. Now we're in $60/mo. We could go cheaper if we want.
We've click-opsed our way here in about 25 minutes. The cluster is up and ready for action.
I write a Dockerfile, push my container, install k3d locally, write about 200 lines of painstaking YAML that I mostly cribbed off of stack overflow, and slam that through kubectl into my local cluster. Everything is coming up roses, so I kubectl apply to my GKE cluster. My app is now live and I crack open a beer. Today was a good day.
Later, whilst inebriated from celebration, I make some changes to the app and deploy live because I'm a cowboy. Oops, the app fails to start but that's okay, the deployment rolls back. No one notices.
The next day my app hits the front page of HN and falls over. I edit my YAML and change a 2 to a 10 and everything is good again.
Things I did not need to care about:
- permissions (I just used the defaults and granted everything via clickops)
- ssh keys (what's ssh?)
- Linux distributions or VM images (the Goog takes care of everything for me, I sleep better knowing I'll wake up to patched OS)
- passwords
- networking, VIPs, top of rack switches, hosting contracts, Dell, racking and stacking, parking, using my phone
And all without breaking the bank.
---
Okay so I cheated, you weren't looking for a GKE vs on-prem/Colo case. You asked
> make the case that applications like Tadalist are best run on K8s rather than a set of conventions around bare-metal container/VMs
to which I say: that's all kubernetes is.
Did you even read their blog post? virtio? F5? MySQL replication?? How is this a good use of engineering time? How is this cost efficient? On what planet is running your own metal a good use of time or money or any other godforsaken resource. They're not even 40 people for crying out loud. It's not like they're, say, Fly.io and trying to host arbitrary workloads for customers. They're literally serving rails apps.
Want to start small with k8s? Throw k3s or k3d on a budget VPS of your choosing. Be surprised when you can serve production traffic on a $20 Kubernetes cluster.
If you care about Linux distributions, and care about networking, and care about database replication, and care about KVM, and care about aggregating syslogs, and love to react to CVEs and patch things, and if it's a good use of your time, then sure do what 37signals did here. But I'm not sure what that planet is. It's certainly not the one I live on today. 10-15 years ago? Sure. But no longer.
I can't believe just how ridiculous this entire article is. I want to find quotes to cherry pick but the entire thing is lunacy. You can do so so much on a cloud provider before approaching the cost of even a single 48U in a managed space.
At some scale it makes sense, but not their scale. If I never have to deal with iDRAC again it'll be too soon.
You have a horse in this race: apps like Tadalist are best run on something like Fly or knative/Cloud Run or Heroku rest in peace. But a set of conventions around bare-metal containers/VMs? Give me a break.
I don't think you intended it, but I find it disingenuous to separate cloud hosting and kubernetes. The two are connected. The entire premise is that it should be a set of portable conventions. I can run things on my laptop or desktop or raspberry pi or $10/mo budget VPS or GCP or AWS or Azure or Linode or god willing a bunch of bare-metal in a colo. It's fundamentally a powerful abstraction. In isolation it makes little sense, which TFA handedly demonstrates. If you eschew the conventions, it's not like the problems go away. You just have to solve them yourself. This is all just NIH syndrome, clear as day.
Forgive the long winded rant, it's been a long day.
Agree, I would never want to go back to the old bad days of managing a real rack at a datacenter, with exactly the same guarantees of a single region deployment inside any cloud.
BUT it is true that all the multi region/AZ guarantees + logs + dashboards + network services @ AWS costs tend to skyrocket in a couple of years.
And here is where k8s really shines, in my opinion: allowing you to abstract your deployment away from a cloud even on cheap hosting.
All the rest outlined in the article is just reinventing the wheel.
Usually those aren't the same engineers that manage the racked stuff in a datacenter and those that deploy the appps.
Last time I was working on prem, we would just buy a new 2U hypervisor server once in a while. Apps were all running on VMs anyway so the complexity was not seen by the same people. Storage were a multiple years deal. The biggest issue was storage estimation and paying from day 1 a storage that would be used fully only on year 5. But I don't think it was that expensive, just an accountability gymnastic comparing to a pay as you go system. And hyperconvergence was kind of meant to solve that although I didn't really had the chance to experiment with it in virtualized environments on prem.
Who's gonna do the rearchitecting work? Are you hiring a whole new team or do you not need to keep the lights on while you're transitioning? Depending on the complexity of your application that rearchitecting is gonna eat up a ton of your cost savings.
> almost by definition, it costs more to run workloads on cloud machines than on your own hardware.
Why should that be so? I'd expect the all-in cost of a cloud machine to be less than my own hardware, for the same reason that buying electricity from the grid is usually cheaper than generating it on-prem.
> You can't address that problem by moving from EKS back to ECS, like one commenter suggested.
If EKS is more expensive (because it's something they see as a value-add) whereas ECS is a commodity service at commodity prices, then moving there could well solve the cost issue.
Wouldn't the cost of cooking be higher depending on who you are? If one could spend few hours doing something that has higher ROI than eating pre-cooked food, then you are actually losing money cooking your own food.
Beyond a certain scale, sure. But at small scale, you can completely avoid hiring an ops team, or hire a much smaller one, which can more than offset the cloud provider price premium.
My current company works in a niche market with a smallish number of large customers, so our scaling needs are modest. Our total AWS bill is about a third the annual salary of a single ops person.
There's gotta be a very long tail of companies like mine for whom outsourcing to cloud vendors is cheaper than self-hosting.
Depends on the industry and barrier of entry. If your in one with alot of compliance overheads your are outsourcing alot more them compute and storage to your cloud provider. Hiring inhouse in that same case its extremely expensive unless you are over a certain size.
This article seems written by someone who gets excited by shiney objects / hype trains.
> Why should that be so? I'd expect the all-in cost of a cloud machine to be less than my own hardware
Because cloud hardware doesn't have all the burdens of physically managing a real server. Replacing SSDs. Upgrading RAM. Logging to a iDRAC to restart a crashed server. All those things don't exist in the cloud and make you loose so much operational time. That's why clouds will ALWAYS cost more than bare metal. The cons is that with cloud you keep paying for the same servers: there are no assets anymore, only costs.
Not to mention keeping spare parts around for when something breaks, or having to drive out to the DC to fix/replace the thing that broke or won't restart. Hell, even something "simple" like managing the warranties for the gear you have is no fun at all. People tend to forget all those little things when espousing the evils of the cloud, but I'm here to tell you that they all add up and they are all a major pain in the butt. Cloud gets rid of all that.
There are also discussions around CapEX versus OpEx that apply here, and depreciating costs over time. There is a trade-off of agility, cost, and maintenance, but the markup on cloud is quite high.
The major determinant in hosting cost isn't power, it's the cost of the hardware. But I mean, even if you don't buy my axiomatic derivation, you can just work this out from AWS and GCP pricing.
I always saw it being close to 7:3 with non recurring hardware cost to mrc facilities & power on 3 year depreciation for major markets.
That said all of the big cloud providers SHOULD have a structural advantage on all of those dimensions. None if the small players or self hosting shops are doing the volume, much less the original r&d, of the big cloud providers. The size of that discount, and how costly it really is to achieve, is another topic.
Disclosure: principal at AWS. Above information is my personal opinion based on general experience of 20 years in the industry doing networking, compute farms, and operations.
Even if [0] cloud does have structural advantage, it’s clear that cloud vendor isn’t willing or wanting to pass them off to customers, and tends to nickel and dime on other necessity like the infamous bandwidth cost.
[0] I’m really curious how big, if any, structural advantage large cloud vendor has over small-time colo user, because surely cloud comes with all kinds of overhead? All the fancy feature AWS provides cannot be free. If customer does not care for those, would colo, or a small “vps” vendor, actually have structural advantage over AWS?
The comments in this thread are quite eye-opening.
It really shows what a sacred cow k8s and cloud has become.
I’m not much of an ops person so I’m not qualified to comment on what 37 signals has created. But I will say I’m glad to see honest discussion around the costs of choosing k8s for everything even if it has significant mindshare.
Perhaps this is the endgame of resume-driven development: cargo culted complexity and everyone using the same tech for similar-ish problems and then wondering why it’s so hard to stand out from both a product and an employee perspective.
Some people are really good at writing software, other people are really good at running systems. k8s/cloud allowed the former to pretend to be good at the latter.
k8s is misunderstood. Everyone focuses on the complexity/over-engineering/etc arguments when those really don't matter in the grand scheme of things.
It's not about any of that, it's about having a consistent API and deployment target that properly delineates responsibilities.
The value of that then depends on how many things you are running and how many stakeholders you have taking part in that dance. If the answer to both of things are small then k8s value is small, if the answer to either of those is high then the value is high.
i.e k8s is about organisational value, it's technical merits are mostly secondary.
The "it's too complex" argument usually reflects more on the commenter than on kubernetes itself. It's actually one of the most very straight forward and thoughtfully designed platforms I've ever worked with.
What I've found in my experience is that applications in general are complex -- more complex than people assume -- but the imperative style of provisioning seems to hide it away, and not in a good way. The inherent complexity hides behind layers of iterative, mutating actions where any one step seems "simple", but the whole increasingly gets lost in the entropic background, and in the end the system gets more and more difficult to _actually_ understand and reproduce.
Tools like ansible and terraform and kubernetes have been attempts to get towards more definition, better consistency, _away_ from the imperative. Even though an individual step under the hood may be imperative, the goal is always toward eventual consistency, which, really only kubernetes truly achieves. By contrast, MRSK feels to be subtly turning that arrow around in the wrong direction.
I'm sure it was fun to build, but one could have spent 1% of that time getting to understand the "complexity" of kubernetes - by the way, which quickly disappears once it's understood. Understandably, though, that would feel like a defeat to someone who truly enjoys building new systems from scratch (and we need those people).
You've hit the nail on the head. Ten thousand simple, bespoke, hand-crafted tools have the same complexity as one tool with ten thousand facets. The real velocity gained is that this one tool with ten thousand facets is mass produced, and in use widely, with a large set of diverse users.
I don't know a single person who's been responsible for infra-as-code in chef/terriform/ansible who isn't more or less in love with Kubernetes (once they get over the learning curve). Everyone who says "it's too complex" bears a striking resemblance to those developers who happily throw code over the wall into production, where it's someone else's issue.
> Understandably, though, that would feel like a defeat to someone who truly enjoys building new systems from scratch (and we need those people).
Exactly. Building new systems from scratch is tons of fun! It's just not necessarily the right business move, unless the goal was to get the front-page of HN, that is.
I've been using Nomad for about 5 months now, and couldn't disagree more. K8s is better documented, with far less glue, and far more new-hire developers are familiar with K8s compared to Nomad. Nomad-autoscaler alone is becoming a decent reason not to use Nomad. The number of abandoned issues on the various githubs is another. That Vault is a first-class citizen of K8s and a red-headed-stepchild of Nomad is another.
I do agree about Helm tho, I avoid it as much as possible.
I hate kubernetes as much as anyone, but building your own container orchestration platform so that you can deploy a handful of CRUD webapps sounds a lot more like resume-driven development than using a well-known and standard (if somewhat overengineered) solution.
I don't think the authors care about their resumes at this point. There are rational reasons to use a static scheduling regime and a set of conventions around deployment and support services rather than a dynamic scheduler. If it were me, I'd build this with Nomad, but I can imagine not wanting to manage a dynamic scheduler when your workloads are as predictable as theirs are --- you make their case for them when you point out that they just have a "handful of CRUD apps".
What is there really? There is docker swarm, which doesn't seem to be really further developed, and... what else?
This whole space seems to be neglected since cloud providers are trying to sell k8s to big company "devops"guys but old school sysadmins don't even know what docker is. Any development in this area is very welcome.
> Perhaps this is the endgame of resume-driven development: cargo culted complexity and everyone using the same tech for similar-ish problems and then wondering why it’s so hard to stand out from both a product and an employee perspective.
Spot on. Tech is a fashion industry and most people just follow trends. I still sometimes wonder if people are playing the elaborate long-term resume-optimisation game, or if they don't value simplicity highly enough to optimise for it, because the downsides are externalised.
k8s folks get paid big money to keep it running. Not surprised by the comments here at all. As the saying goes, "in complexity, there is opportunity." and the k8s devops team is milking it hard.
Only one sentence about why they chose to abandon K8s:
> It all sounded like a win-win situation, but [on-prem kubernetes] turned out to be a very expensive and operationally complicated idea, so we had to get back to the drawing board pretty soon.
It was very expensive and operationally complicated to self-host k8s, so they decided to build their own orchestration tooling? Sort of undercuts their main argument that this bit isn't even remotely fleshed out.
We are talking about 37Signals here. This is the company that, when faced with the problem of making a shared to-do list application, created Ruby on Rails. And when they decided to write up their remote working policy, published a New York Times bestselling business book.
This is not a company that merely shaves its Yaks. It offers a full menu of Yak barber services, and then launches a line of successful Yak grooming products.
The article seems to provide evidence for the claim that a dispute within the company over the messaging from leadership led to 1/3 of the staff leaving. I provided it without comment.
Do you believe that a significant proportion of the staff did not quit? Do you have an alternative source that provides evidence for that version of events?
announced their intention to leave... to the company... in response to the company making an open offer to people of terms for them to leave.
That seems like a slightly different prior, in terms of our Bayesian assessment of the probability that those people remained employed at the company afterwards, than your hypothetical engagement to Ms Johannsen.
So strange to white-knight a company and attempt to deny something that happened pretty publicly...
> As a result of the recent changes at Basecamp, today is my last day at the company. I joined over 15 years ago as a junior programmer and I’ve been involved with nearly every product launch there since 2006.
> So strange to white-knight a company and attempt to deny something that happened pretty publicly...
it was just skepticism from seeing these sorts of claims over the years. Half of hollywood would be in canada if people really followed up on those. At some point it became acceptable to make these sort of claims with no intention of following up.
I guess quitting your job in the hottest tech market of all time is a little different than moving to a different country.
> Last week was terrible. We started with policy changes that felt simple, reasonable, and principled, and it blew things up internally in ways we never anticipated. David and I completely own the consequences, and we're sorry. We have a lot to learn and reflect on, and we will. The new policies stand, but we have some refining and clarifying to do.
They seem to have lost their touch though. I think they peaked with Remote.
After typing that I found that they renamed from Basecamp Inc. back to 37signals and their website is trying to hearken to their past. https://en.wikipedia.org/wiki/37signals
You could just look this up. They renamed to Basecamp because they decided to be a single-product company (at the same time, they divested Highrise and Campfire). Six years later, they launched HEY, their email product, so "Basecamp" stopped making sense as a name. They wrote a post about this last year.
later
I added "six years later", but I don't think it changes the meaning of what I wrote originally.
Sometimes there's value in building bespoke solutions. If you don't need many of the features of the off-the-shelf solution, and find the complexity overwhelming and the knowledge and operational costs too high, then building a purpose-built solution to fit your use case exactly can be very beneficial.
You do need lots of expertise and relatively simple applications to replace something like k8s, but 37signals seems up to the task, and judging by the article, they picked their least critical apps to start with. It sounds like a success story so far. Kudos to them for releasing MRSK, it definitely looks interesting.
As a side note, I've become disgruntled at k8s becoming the defacto standard for deploying services at scale. We need different approaches to container orchestration, that do things differently (perhaps even rethinking containers!), and focus on simplicity and usability instead of just hyper scalability, which many projects don't need.
I was a fan of Docker Swarm for a long time, and still use it at home, but I wouldn't dare recommend it professionally anymore. Especially with the current way Docker Inc. is managed.
I think people overindex on thinking that Kubernetes is about scalability.
Honestly, its inbuilt horizontal scaling systems are pretty lacking. Scaling is not actually K8s's strong suit - sure, you can make it scale, but that takes effort and customization.
But what K8s, at base, is actually useful for is availability.
You tell K8s how many instances of a thing to run; it runs them; if any of them stop running, it detects that and tries to fix it.
When you want to deploy a new version, it replaces the old instances with new ones, while ensuring traffic still gets served.
And it does all of this over a substrate of shared underlying server nodes, in such a way that if any of those servers goes down, it will redistribute workloads to compensate.
All of that is useful even if you don't care about scale.
> simplicity and usability instead of just hyper scalability
This is such a key phrase here.
If I'm starting a small SaaS company tomorrow, my ideal for setting up infrastructure would be a stack which can for now look similar to what this article sets up (especially with the tremendously lower bills), but with an easy migration path to k8s, should I hit the jackpot and have that 'very good problem to have' of too many customer requests to handle.
My big issue with k8s, and honestly with other big fancy toolsets, is that getting started with it requires you to choose between:
- Hire several seasoned cloud orchestration experts, preferably with the platform you've chosen (AWS, GCP, Azure) who will know how to troubleshoot this beast when you have a mysterious issue, or:
- YOLO it! Just follow the basic tutorials to set k8s up, and hope you don't end up sitting up all night with a site that's refusing connections while your customers flee.
The first one is the only responsible choice but it's going to add another half million to your cash burn, and that's on top of the high-margin "managed" service cloud bills like RDS.
So I can see why people are drawn to a system where instead of paying for k8s and "Postgres in a box" they can pay for a simple server and have simple tooling to deploy, back up, etc.
That's not a great comparison, but it works in a sense. Not all languages and applications benefit from pointers.
The issue is not about k8s being hard. Yes, it has a steep learning curve, but many technologies do. The issue is that learning all of its intricacies, and maintaining it and the services that run on top of it, requires valuable resources many companies don't have, especially early on. And the benefits it provides are, for the most part, not needed early on in a project's lifecycle, and often never. In financial terms, it's very high risk, with low ROI.
If there's a solution that lowers the investment and maintenance costs, while being valuable in the short and long term, then that's generally a more favorable solution for most projects that don't operate at Google's scale.
There is the learning curve, which can be challenging for organizations that aren’t experienced or exposed to scale and performance expectations. When a company moves away from being insular & proprietary to using open source there is a period of churn that ripples through the deployment, implementation and day to day operations aspect of products that live either on customer premises or a cloud platform new to everyone.
There, what YOU know from experience and have evolved and worked through is unknown—because it is all new. And “training” (such as it is) is left as an exercise for each individual.
I’d expect that is the norm for the traditional non startup firms, globally.
There is a very big difference from being a user of K8s and being someone maintaining a K8s cluster.
If you are a user of K8s, then yeah, deploying apps is pretty simple most of the time.
Maintaining a K8s cluster on the other hand becomes very complex and hard the moment you have use cases that are a few steps off the happy path. The K8s documentation is not sufficient for operating a K8s cluster on your own hardware, you end up having to go spelunking in the code to see how things work (this is from experience).
Pointers are hard though, for the average programmer as is memory management.
When you transition an IT team or a customer facing product support team to DevOps, most everything appears complex if the implementation has been done by engineers new to DevOps and cloud itself. Engineers with zero background in scale out or performance for larger customers. It is a cultural/experience change that faces issues at actual deployment time.
I'm happy with my usage of k8s, but I think it's unfortunate that current container abstractions are so oriented around imperative assembly in "layers". I want a way to run NixOS in a container and have it feel first class— existing approaches either require installing everything every time with no caching, or pre-building and manually lifecycling your container (streamLayeredImage), or knowing upfront what you're going to need (Nixery).
> Especially with the current way Docker Inc. is managed.
I was reviewing GCP's Free Tier today, they have the same approach, if they need to change or drop services they agree to give 30 days notice, same as Docker did. It's probably common for other cloud companies offering free stuff as well. All the negative attention Docker received was fully and wholly undeserved.
> I was reviewing GCP's Free Tier today, they have the same approach
Google is notoriously bad about this and gets negative attention from it, so the comparison isn't favorable, and the publicity is still wholly deserved.
>> I was reviewing GCP…
> Google is notoriously bad at this…
Do you mean Google or GCP? We don’t see complaints about AWS because Amazon closes Dash buttons or Spark, and also Azure is not seen in any worse light due to Microsoft discontinues Skype and what not.
Can we name one remotely popular service of GCP that has been shut down at all?
I can't think of a single incident where GCP actually dropped a free tier; I actually see new free tier stuff added since the last time I looked. If you can provide some excellent links to reflect your view, that I've somehow missed along the lines; it would be interesting to compare.
Until then, I maintain the Docker publicity is undeserved and if I had to guess, was brought on by podman astroturfers who have been polluting the web the past 2 years claiming how great podman is.
Yeah I'm a bit surprised to hear that. I had only heard a lot of teams giving up swarm when it was deprecated. Didn't know they just restructured the project.
> It was very expensive and operationally complicated to self-host k8s, so they decided to build their own orchestration tooling?
You are deeply misunderstanding Kubernetes if you think it's some sort of a turnkey solution that solves all your infrastructure problems. Virtually everything of value in Kubernetes isn't Kubernetes -- you have to add it on later, and manage it yourself. Container runtime? -- that's not Kubernetes. Database to store deployment info? -- that's not Kubernetes. Network setup and management? -- that's not Kubernetes. Storage setup and management? -- still not Kubernetes.
When you start using Kubernetes for real, you will end up replacing almost every component it brings by default with something else. CoreDNS? -- sucks for big systems. Volumes? You aren't going to be using volumes from local filesystem... that's insane! You'll probably set up Ceph or something like that and add some operators to help you use it. Permission management? -- Well, you are out of luck in a major way here... you have, basically, Kyverno, but it really, really sucks (and it's still not Kubernetes!).
Real-life Kubernetes deployments end up being very poorly stitched together piles of different components. So much so that you start wishing you'd never touched that thing because a huge fraction of the stuff you now need to manage is integration with Kubernetes on top of the functionality provided by these components.
> You are deeply misunderstanding Kubernetes if you think it's some sort of a turnkey solution that solves all your infrastructure problems. Virtually everything of value in Kubernetes isn't Kubernetes -- you have to add it on later, and manage it yourself. Container runtime? -- that's not Kubernetes. Database to store deployment info? -- that's not Kubernetes. Network setup and management? -- that's not Kubernetes. Storage setup and management? -- still not Kubernetes.
When you install Kubernetes, you get a container runtime. That's a distribution I guess. Part of this seems like GNU/Linux.
The other stuff you're listing isn't solved by MRSK either...
I don't know, for small scale, K8S rocks: I just fired up Kubespray and have a 20-node cluster up and running in maybe an hour, CoreDNS haven't gave me any problem so far.
Using local volumes is actually not an insane idea if your stateful service can handle data replication themselves: many modern databases can.
Local volumes don't have a concept of quota. You cannot limit them to X bytes. So, if you give a single service a volume, it might just take the whole disk. Well, technically, it might just take the whole filesystem, which, if you have multiple disks used by a single filesytem, will mean it'll take all of them.
Obviously, you cannot move local volumes around.
And if you are setting up a database in Kubernetes... oh, you are in such a pit of troubles, that dealing with local volumes isn't really even worth mentioning. Surprisingly, your problems don't even start with storage, they start with memory. Databases really like memory, but use it very opportunistically, and scale well with load. So, when you configure your database, you tend to give them all the memory you have, but when they use it, it will really depend on the load and the kind of queries, how well they optimize it. Since Kubernetes scheduler doesn't really do well with reservations, you may run into situations where your database OOMs or just slows everything down, or doesn't perform well at all...
Next comes fsync. Unlike many unsophisticated applications, databases don't like losing data. That's why they want to use fsync in some capacity. But this creates problems sharing resources, again, well beyond anything Kubernetes can help with.
Next comes provisioning of high-quality storage for databases... and storage likes to come in the form of a device, not filesystem, but Kubernetes doesn't know how to deal with devices, so, it needs a help from CSIs of all sorts to do that, and depending on technology you choose, you'll have a very immersive journey into the world of hacks and muti-thousand page protocol descriptions telling you how to connect your storage and Kubernetes.
It might appear though, at the first glance, that things work well w/o much intervention, and there's a Helm chart for this or the other provider, and it's all at the tips of your fingers... but, as it often is in the world of storage, things get extremely complicated extremely quickly in case of errors. In such situations, Kubernetes will only obscure the problem. Oh, and errors in storage don't usually happen in the next hour or day or even year after you've set it up. It hits you few years later, once you've accumulated a ton of useful data and you've entirely forgotten how things have been set up, and folks in Kubernetes had moved on and broke stuff.
---
So, not only do you need small scale, you also need a very short temporal scale: don't expect your Kubernetes cluster to work well after about a year of being deployed. Probably not at all after five years.
But then... if it only works at small scale and for short time? -- is it really worth the trouble? I mean, Kubernetes isn't a small thing, it takes away a big constant share of your resources, which it promises to amortize with scale. You are essentially preaching the same idea as Electron-based desktop applications or Docker containers that create a lot of duplication of entire Linux user-space + a bunch of common libraries if you aren't extremely careful with that. Doesn't it become an argument for producing hot garbage as fast as possible so that someone else who can do a better job won't get a chance of selling their goods because they didn't have time to deliver?
Man, you really like to complicate stuffs just to take a dig at K8S.
>> Local volumes don't have a concept of quota. You cannot limit them to X bytes. So, if you give a single service a volume, it might just take the whole disk.
That's why we monitor our server disk for usage.
>> Obviously, you cannot move local volumes around.
Most of the time, this is not a requirements for database.
>> Since Kubernetes scheduler doesn't really do well with reservations, you may run into situations where your database OOMs or just slows everything down, or doesn't perform well at all
Unless it's test cluster with constraint resources, no other services will run on database nodes, through the use of taint and toleration. We can let the database use all the CPU and memory it wants
>> fsync
Doesn't matter with local volume, since it's just a directory on the host system.
>> Next comes provisioning of high-quality storage for databases... and storage likes to come in the form of a device, not filesystem
We didn't deploy our databases with raw block devices, even before K8S. Using regular filesystems make everything much simpler and we did not see any performance difference.
>> You are essentially preaching the same idea as Electron-based desktop applications or Docker containers that create a lot of duplication of entire Linux user-space + a bunch of common libraries
Yeah, no. If that's how you read it, be my guest, but don't put words into my mouth.
To be fair that served them well in the past: the reason why anyone knows about 37signals is because they reinvented the wheel back in 2004 with Rails, but what a great reinvention it was. Who knows what can come next.
Which wheel did they reinvent? Rails literally set a bunch of standards used by just about every framework today… app generators, conventions over configuration, asset pipelines, you name it.
Well, as with all homebrewn solutions, you don't know if you are reinventing the wheel until you're done. At first, it always starts with "the current solutions that are available do not fit me, but I still could use them to achieve what I want". There was nothing forcing 37signals back in 2004 to roll their own framework in order to support developing their apps, but they did anyway.
And for every Rails out there, there are thousands of internal frameworks with big ambitions that just turned out to be inferior to what's already available. You just can't know it when you start developing. It takes a bit of ego and ambition to go that path, but sometimes it pays off. And my guess is that if it paid off in the past, you're more likely to try it again.
I think what they wanted to convey was not the redundancy of 'reinventing the wheel', but the ambitious scope and from-scratch approach associated with the phrase.
Maybe 'rolled their own' or 'first invented the universe' would have been slightly better.
I dunno. I was a kid when Rails took over the world, so I couldn't even begin to tell you why it succeeded in the way that it did.
But I do feel like they probably know what they're doing enough to have a more modest version of success with this other project, i.e., meeting their own needs well without burning up too much money or time. They're still a really small, focused company, and they have a lot of relevant experience.
Well to be fair Kubernetes doesn't always pluralize the names of collections, since you can run "kubectl get deployment/myapp". You don't want to do the equivalent of "select * from user" do you? That doesn't make any sense!!! And don't translate that to "get all the records from the user table"! That's "get all the records from the users table". (Rails defaults to plural, Django to singular for table names. Not sure about the equivalent for Kubernetes but in the CLI surprisingly you can use either)
To be fair, from the article it says that they built the bulk of the tool and did the first migration in a 6-week cycle. mrsk looks fairly straight forward, and feels like Capistrano but for containers. The first commit of mrsk is only on January 7th of this year.
In less than a six-week cycle, we built those operational foundations, shaped mrsk to its functional form and had Tadalist running in production on our own hardware.
They spent a month and a half building tooling _capable of handling their smallest application_, representing an extremely tiny fraction of their cloud usage.
k8s is an industry standard currently, but it is not great. The lack of available free/open tooling to set up and manage them (the cluster) properly seems to indicate that it is also a way of selling it (cloud). Meaning that if you want to use k8s you have to go with the large cloud providers, otherwise your life will be painful.
I for one am patiently waiting for more innovation in this area and seeing that there are companies that try to disrupt/improve it makes me hopeful and I appreciate it.
k3s is lightweight and even I have clusters running, I can sync them too if I wish, easily, I agree, it seems odd they didn't go with some kube design on prem.
I'm not sure how well 37Signals is doing these days - Hey didn't make as big an impact as they had hoped and Basecamp probably has a core of loyal users but I don't think it's getting a ton of new customers. They're small and could probably keep going until their founders decide to retire though.
It does seem like they just moved all of their infra components, and got rid of autoscaling.
Load balancing, logging, and other associated components are all still there. Almost nothing changed in the actual architecture, just how it was hosted.
I have a hard time seeing why this was beneficial.
That answers my question, they can afford it if they wanted to. Obviously they don't want to. I'm in their camp when it comes to the cloud vs own hardware.
Zero, which is why we're not using k8s on-prem. Our team is already handling the on-prem hardware/software environment, and this will consolidate our apps on a single platform methodology, allowing us to keep the same team size. Using mrsk allows us reduce the complexity of our servers, moving that into the Dockerfile.
If we had gone down the k8s on-prem rabbit-hole, I suspect we would have required more folks to manage those components and complexity.
I don't understand how having k8s means you need significantly more people.
It's just concepts put into a strict system. Now you're just shimming the same concepts with less supported hacks. Now you have to train your team on less used technology that isn't transferable to other roles. Sounds like technical debt to me.
We're arguing about generic approaches and the 37Signals folks are making specific decisions about their very specific situation (their app, their staff having time or not, their budget, etc).
To be fair, they don't seem to be saying their strategy is for everybody but the audience thinks so? I think we're talking past each other, tbh.
This company invented Ruby on Rails and was in business before ‘cloud’ was a thing. Many things can be said about 37signals and DHH in particular, but lacking proper experience is definitely not one of those.
Answer to your question is cos peoples experiences differ, wildly in some cases.
Your account was created 18hours ago so I can’t see really what support there is for this specific throwaway account to be declared an expert in anything. Are you self proclaimed expert or world renowned expert? Since they are world renowned buch… :)
I only create my accounts adhoc because I spend too much time on discussions otherwise.
But my argument was more in sense of contradicting the original argument. No one is an expert just because.
I for myself I'm a cloud architect in a very big company and have introduced a k8s based platform in two projects. One internal on gke and one in a opensource project.
Both are used by 15-20 teams.
I also run k8s at home for fun and in a small startup.
I'm probably doing primarily k8s for 5 years and was a software engineer before.
I have been saying this to my customers for a long time, most projects do not really benefit from K8s but on the contrary, it is a huge operational/cost compromise to use K8s for a monolith app that does simple CRUD operations where occasional downtime is actually acceptable.
In my last project, I removed the unnecessary complexity that K8s was bringing and went back to ansible scripts, which has worked nicely.
With another customer, we inherited a frontend application that was being deployed with K8s while vercel is a considerably simpler/faster alternative.
K8s certainly has its advantages but I'd bet that many projects using it do not gain much.
My impression is that it makes deploying your 100th server much easier, at the cost of making your first several much harder. If you're going to have 100+ servers, that's probably worth it. If you're not (and most companies aren't), then it's like getting your CDL so that you can go to the grocery store in a semi-tractor trailer, when you should have driven there in a compact car.
This seems like an application/stack that didn't have a valid need for k8s in the first place. Don't just use K8s because its what people say you should do. evaluate the pros and the VERY real cons and make an informed decision.
That's why we've had good results with ECS. Feels like 80% of the result for 20% of the effort, and I haven't found our use cases needing that missing 20%.
On the Google cloud side, using Google cloud build with cloud run with automatic CI/CD is very straightforward. I setup automated builds and deploys for staging in 2 hours. For production I set it up to track tagged branches matching a regex.
We use Fargate, and what we launch is tightly coupled to our application (background jobs spin down and spin up tasks via the SDK) so for now, we aren't doing anything with IaC, other than CI deployment.
When I had to set up ECS with Fargate using CloudFormation the documentation was certainly lacking (in late 2019 I think it was).
Now that it's working it's been pretty low maintenance.
It has definitely gotten better over time, but we tend to do a lot of stuff ad-hoc that finds its way into production lol, so we aren't yet relying on any infra as code.
“Need”
Eh, I do it because it’s awesome for a single box or thousands. Single sign on, mTLS everywhere, cert-manager, BGP or L2 VIPs to any pod, etc and I can expand horizontally as needed. It’s the best for an at home lab. I pity the people who only use Proxmox.
Throughout my company’s pursuit of moving everything under the sun into AWS I have done my best to keep everything able to be migrated, we have some systems which are just, simply going to have to be completely rebuilt if we ever needed to move them off of AWS, because there is not a single component of the system that doesn’t rely on some kind of vendor lock-in system AWS provides.
I aim to keep everything I’m working on using the simplest services possible, essentially treating AWS like it’s Digital Ocean or Linode with a stupidly complex control panel. This way if we need to migrate, as long as someone can hand me a Linux VM and maybe an S3 interface we can do it.
I really just have trouble believing that everyone using Kubernetes and a bunch of infrastructure as code is truly benefiting from it. Linux sysadmin isn’t hard. Get a big server with an AMD Epyc or two and a bunch of RAM, put it in a datacenter colo, and maybe do that twice for redundancy and I almost guarantee you it can take you at least close to 9 figures revenue.
If at that point it’s not enough, congratulations you have the money to figure it out. If it’s not enough to get you to that point, perhaps you need to re-think your engineering philosophy(for example, stop putting 100 data constraints per endpoint in your python API when you have zero Postgres utilization beyond basic tables and indexes).
If you still really genuinely can’t make that setup work, then congratulations you are in the 10%(maybe) of companies that actually need everything k8s or “cloud native” solutions offer.
I would like to note that given these opinions, I do realize there are problems that need the flexibility of a platform like AWS, one that comes to mind is video game servers needing to serve very close to a high number of geographic areas for latency concerns.
> I aim to keep everything I’m working on using the simplest services possible, essentially treating AWS like it’s Digital Ocean or Linode with a stupidly complex control panel.
What's the benefit of AWS then, if you're not using any of the managed services AWS offers, and are instead treating AWS as an (overly expensive) Digital Ocean or Linode?
Wow.
"K8s is simple", it has the same vibes as Linux user vs Dropbox:
'...you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem'
https://news.ycombinator.com/item?id=8863
It's not that Kubernetes is simple (it's not), but Kubernetes is relatively simple compared to the task it accomplishes.
If you have containers that need to be scheduled and networked and supplied with storage and exposed to the internet across a large set of machines past the scale where you can easily do so with tools like docker-compose, Kubernetes (might) be for you. There's a good chance it will be simpler to understand and reason about than the homegrown kludge you could make to do the same thing, especially once you understand the core design around reconciliation loops.
That said, you might not need all that, and then you probably shouldn't use Kubernetes.
Tell that to the myriad of folks making their money off of peddling it. You'd swear it were the only tool available based on the hype circles (and how many hiring manager strictly look for experience with it).
I gotta say from dev perspective it is very convenient solution. But I wouldn't recommend it to anyone that runs anything less complex than "a few services in a database". The tens of minutes you save in writing deploy scrips will be replaced by hours of figuring out how to do it k8s way.
From ops perspective let's say I ran it from scratch (as in "writing systemd units to run k8s daemons and setting up CA to feed them", because back then there was not that much reliable automation around deploying it) and the complexity tax is insane. Yeah you can install some automation doing that but if it ever breaks (and I've seen some breaking) good fucking luck, non-veteran will have better chance with reinstalling it from scratch.
Except it was created to model virtually every solution to every compute need. It’s not about the compute itself, it’s about the taxonomy, composability, and verifiability of specifications which makes Kubernetes excellent substrate for nearly any computing model from the most static to the most dynamic. You find kubernetes everywhere because of how flexible it is to meet different domains. It’s the next major revolution in systems computing since Unix.
I (roughly) believe this as well[0], but more flexibility generally means more complexity. Right now, if you don't need the flexibility that k8s offers, it's probably better to use a solution with less flexibility and therefore less complexity. Maybe in a decade if k8s has eaten the world there'll be simple k8s-based solutions to most problems, but right now that's not always the case
[0] I think that in the same way that operating systems abstract physical hardware, memory management, process management, etc, k8s abstracts storage, network, compute resources, etc
Always two extremes to any debate. I've personally enjoyed my journey with it. I've even been in an anti-k8s company running bare metal on the Hashi stack (won't be running back to that anytime soon). I think the two categories I've seen work best are either something lik ECS or serverless and Kubernetes.
De-clouding is going to be a huge trend as companies are pressured to save costs, and they realize on-prem is still a fraction of the cost of comparable cloud services.
This whole cloud shift has been one of the most mind-blowing shared delusions in the industry, and I'm glad I've mostly avoided working with it outright.
The thing that gets me about it is the very real physical cost of all this cloud waste.
The big cloud providers have clear cut thousands of acres in Ohio, Northern VA, and elsewhere to build their huge windowless concrete bunkers in support of this delusion of unlimited scale.
Hopefully as the monetary costs become clear their growth will be reversed and these bunkers can be torn down
Much more than efficient. You think AWS is getting the same CPU normal civilians get? No way dude. Those guys are big enough that they can get custom hardware just for their specific needs. They’re cooling systems, power systems, everything is way more efficient. And they are big enough they can afford to measure every single metric that matters and optimize every one.
For what it's worth, large providers will always need datacenters. But perhaps datacenters run by public cloud providers today will be sold off to larger businesses running their own infrastructure someday at a discount. Most of the infrastructure itself all will age out in five or ten years, and would've been replaced either way.
Heck, datacenters in Virginia are likely to end up being sold directly to the federal government.
Our firm started the big cloud initiative last year. We have our own datacenters already, but all the cool startups used cloud. Our managers figure it'll make us cool too.
This sort of thing is absolutely insane. Like, sure, small office, no existing datacenter infrastructure, it might make sense to bootstrap your business on someone else's cloud. But if you literally have a cooled room and an existing network infrastructure, it's absolutely silly to spend money on using someone else's.
Something I feel like these conversations seem to miss is that it is not binary; you don't have to host hardware on-prem if you don't want to be in AWS. There are other clouds. There are Sungards of the world were you can pay for racks of managed hardware. There are a lot of options between buying and managing your own hardware and AWS.
Good for them. Now they have a one-off to manage themselves. It’s pretty easy to de-cloud using something like k3s. So much value added in Kubernetes to leverage. But they have Chef and they’re a Ruby shop, I guess they’ll be good.
TBH, Kubernetes has some really rough edges. Helm charts aren’t that great and Kustomize gets real messy real fast.
The scope of their self-developed tool doesn't seem very large, it looks like it could be a wrapper around SSH. I've done similar things using a SSH library with python to deploy and run docker-compose yamls on multiple servers.
There are many of these tools out there. When I was working for Technicolor Virdata some years ago, we’ve been heavily invested in https://github.com/infochimps-labs/ironfan. It was extensible, we had support for SoftLayer and IBM SCE, we had some patches to make the bootstrap and the kick command perform faster. But it was still slow and people didn’t like Ruby (I don’t mind it).
Even back then I wasn’t a fan of doing a proactive ssh connection to the node. I always leaned towards the machine pulling artefacts and deploying them. Like Flux CD does.
> It also misses the entire sphere around identity and access management for those resources that also needs to be maintained
Well, how is this all solved with their new tooling? Like they describe a whole huge complicated problem space and then write a tool for the simplest part of it: deploying an app. :shrug:
We use k8s to run the app both on AWS, on our own hardware in a few datacenters (in countries with strict personal data laws) and on clients' own servers as well (something like the banking sector or a jewelry company, i.e. companies which don't trust the cloud).
From what I heard, AWS is the most stable and easiest to work with of all; the servers which run on our own hardware have more outages and our SRE team often needs to make trips in person to the datacenters to replace hardware etc. Clients' hardware is the faultiest (unsurprisingly). Ideally we'd rather host everything on AWS :)
The thing I noticed is that they are not using any other AWS services. No S3, Elasticache, DynamoDB, etc. They are just running applications and databases.
This will not be the case with many people using cloud and a migration to bare metal will be much harder. Each of those services needs an equivalent to be deployed and managed and it's features might be up to what the AWS equivalent has.
Even the stuff that they are moving (databases, load balancers, etc) is significant operational overhead. In AWS database fail-over is an option you tick. Self hosting has whole books written about how to do database high availability.
The whole kubernetes section of this writeup is two sentences. They went with a vendor provided kube & it was expensive & didn't go great.
It just sounds like it was poorly executed, mostly? There's enough blogs & YouTube of folk setting up HA k8s on a couple rpi, & even the 2GB model works fine if having not-quite-half the ram as overhead on apiservers/etcd nodes.
It's not like 37signals has hundreds of teams & thousands of services to juggle, so it' s not like they need a beefy control plane. I dont know what went wrong & there's no real info to guess by, but 37s seems like a semi-ideal easy lock for k8s on prem.
It seems like a lot of effort to do less. Hopefully it helps others too I guess. But it feels like a problem space with a lot lot of inherent complexity, that's liable to expand over time, & there is a very high skepticism I'd have to folks who opt to greenfield it all.
Sure, there is some inherent complexity, but by writing their own tool, they get to choose exactly how to handle the complexity for their particular use case, instead of having it dictated by a general-purpose tool developed by a consortium of US corporations. I consider that a win.
If they have the manpower and expertise to do that, more power to them!
Wow, uh, this is just such a sad short statement. It's just so woefully out of touch, so baselessly derogatory.
Kube is mostly a pretty generic idea, that greatly empowers folks to write their own stuff. There are dozens of gitops systems. There are hundreds of event-based systems. They almost all have some Custom Resources registered in API Server, but that's because it's good & doesn't encumber anyone. Beyond that it feels like the sky is the limit.
There are some deeper kube things. There's a Scheduler Framework that has a huge framework on extensibility, on modular plugins, to create huge flexibility to make this general.
This zeal, this desire to feel oppressed, this righteousness of rebellion: I wish it also could reflect & understand options & cooperation & possibility, see how a lot of the terrifying forces out there don't want us all consigned to narrow fixed paths. More people than you a knowledge want to potentiate & enrich. The goal of these efforts is anything but to dictate to us how we do things, and it's so easy, so simple to see that, to explore how flexible & varied & different t these world class cluster operating systems we're working on together are and how they hp us accomplish many different ends, how they help us explore new potetential ends.
On one hand, yes, in theory k8s is pretty extensible. In practice, though, you always end up being forced to do things you do not want or need to do, or being prevented from doing things you want to do, because of vendor specifics. Sometimes that is an acceptable tradeoff, sometimes not.
Plus, it is always good to take a step back and appreciate that monoculture is a bad thing in computing. We always need more different approaches, viewpoints, solutions to the same problems. Should everyone roll their own? Of course not - that's why I mentioned having sufficient manpower and expertise to do that.
We should be applauding having more choices and cheering, not scolding those who strive to provide them.
As for your last paragraph, I completely agree, we need to share the knowledge and cooperate. But expecting corporations to "potentiate & enrich" us is rather naive. They will play nice only as long as they need to, and the minute their financial incentives do not align with sharing, they will do their best to pull the rug from underneath everybody else. Even their sharing phase is only to build levers to use in the future. We've seen it over and over and over for the past several decades, with Oracle, SCO, Microsoft, Apple, Google, ... heck, I could pretty much list all big companies.
So as an industry we've been having some version of this debate (at FB we were having it at least as far back as 2014, my org was IIRC the first big one to test-drive our Borg-alike container solution).
These days I think maybe it's just that classic dilemma: over-design and over-build to be ready for contingencies, or build just what we know we need and maybe get caught with our slacks down. This goes by a zillion names WET vs DRY, YAGNI, microservice vs monolith, there are countless variations on the same core idea.
If you start with PHP and MySQL and a chain-smoking sysadmin, and you get hit with hyper-growth then you adapt or die, and you have a mountain of data to figure it out. This is paradoxically an easier decision tree (IMHO) even if maybe some of the engineering is harder or at least higher-stress.
But by far the more common case is that we're building something that isn't huge yet, and while we hope it goes huge we don't actually know if it will: should we build more features and kinda wing it on the operability/economical/automated can of worms, or should we build for the big time from day one?
I think it's a legitimately hard set of questions and reasonable people can disagree. These days I think the only way to fully screw it up is to get ideological rather than pragmatic about it.
A lot of people are kind of missing the forest for the trees here. Ignore the fact that what they're doing is probably a terrible idea for most other people. If it works for them, that's fine. It might only work for them, and that's fine.
Don't paint your bike shed orange just because somebody famous painted theirs orange. They have their reasons. Paint yours whatever color works best for you, for your own reasons.
It's anecdotal but the sentiment I have is that the Kubernetes ecosystem drains an even bigger part of the collective effort required to provide business value. I believe many engineers have a disconnect on what it means to provide real business value.
Solutions like Kubernetes are designed to be able to accommodate an endless number of scenarios out of which you probably only need 1 to provide value for your business. The consequence is that there's a disproportionate ratio of Kubernetes possibilities hence complexities vs. the simplicity of your requirements. Once your workload runs on Kubernetes, you cannot afford to ignore the complexities of Kubernetes so you are automatically sucked into the rabbit hole.
They could, but instead they're doing something closer to static scheduling. They have a small set of applications and a lot of visibility into what their needs are going to be, so the complexity of a dynamic scheduler might not pay its own freight in their environment.
I like Nomad a lot and it's what I would use if I were migrating a "halfheartedly" K8s application to on-prem metal, but I couldn't blame someone who felt burned by K8s complexity for not investing in another dynamic scheduler.
K8s, Docker and AWS/GCP/Azure are to ops what React is to web development, ie. rarely the appropriate tool for the job. Trouble is you now have a generation of devs who have no experience with anything else.
At one of my former workplaces we ran Kubernetes on premises and it worked like a charm. I still think that Kubernetes can be a good fit for microservices even if you use your own hardware.
I think it is cool they are developing new tooling. I don’t understand all the negativity. Isn’t it good that people keep innovating in this space?
Also how is this different from deploying before k8s and TF etc was a thing? We would write our own scripts to manage and deploy our servers. This is the same no? Just bit more structured and it has a name.
37signal folks love to put a spin on anything they do as if it's ground breaking or super innovative... but it rarely is. In particular, they love to take a contrarian position. Like their books, there really isn't anything interesting written here.
I'm not going to put this down, because it sounds like they're quite happy with the results. But they haven't written about a few things that I find to be important details:
First, one of the promises of a standardized platform (be it k8s or something else) is that you don't reinvent the wheel for each application. You have one way of doing logging, one way of doing builds/deployments, etc. Now, they have two ways of doing everything (one for their k8s stuff that remains in the cloud, one for what they have migrated). And the stuff in the cloud is the mature, been-using-it-for-years stuff, and the new stuff seemingly hasn't been battle-tested beyond a couple small services.
Now that's fine, and migrating a small service and hanging the Mission Accomplished banner is a win. But it's not a win that says "we're ready to move our big, money-making services off of k8s". My suspicion is that handling the most intensive services means replacing all of the moving parts of k8s with lots of k8s-shaped things, and things which are probably less-easily glued together than k8s things are.
Another thing that strikes me is that if you look at their cloud spend [0], three of their four top services are _managed_ services. You simply will not take RDS and swap it out 1:1 for Percona MySQL, it is not the same for clusters of substance. You will not simply throw Elasticsearch at some linux boxes and get the same result as managed OpenSearch. You will not simply install redis/memcached on some servers and get elasticache. The managed services have substantial margin, but unless you have Elasticsearch experts, memcached/redis experts, and DBAs on-hand to make the thing do the stuff, you're also going to likely end up spending more than you expect to run those things on hardware you control. I don't think about SSDs or NVMe or how I'll provision new servers for a sudden traffic spike when I set up an Aurora cluster, but you can't not think about it when you're running it yourself.
Said another way, I'm curious as to how they will reduce costs AND still have equally performant/maintainable/reliable services while replacing some unit of infrastructure N with N+M (where M is the currently-managed bits). And also while not being able to just magically make more computers (or computers of a different shape) appear in their datacenter at the click of a button.
I'm also curious how they'll handle scaling. Is scaling your k8s clusters up and down in the cloud really more expensive than keeping enough machines to handle unexpected load on standby? I guess their load must be pretty consistent.
> First, one of the promises of a standardized platform (be it k8s or something else) is that you don't reinvent the wheel for each application. You have one way of doing logging, one way of doing builds/deployments, etc.
You can also hire people with direct relevant experience with these tools. You have to ramp up new developers to use the bespoke in house tooling instead.
Yes and no. Different types of memory management essentially accomplish the same thing. The way you build for them and their performance characteristics vary. In that way, scaling is the same.
But scaling is different in that you're physical ability to scale up with on-prem is bounded by physically procuring/installing/running servers, whereas in the cloud that's already been done by someone else weeks or months ago. When you shut off on-prem hardware, you don't get a refund on the capex cost (you're only saving on power/cooling, maybe some wear and tear).
It's not just that you need to plan differently, it's that you need to design your systems to be less elastic. You have fixed finite resources that you cannot exceed, which means even if you have money to throw at a problem, it doesn't matter: you cannot buy your way out of a scaling problem in the short-medium term. If you run out of disk space, you're out of disk space. If you run out of servers with enough RAM for caching, you're evicting data from your cache. The systems you build need to work predictably weeks or months out, and that is a fundamentally different way of building large systems.
This is it, and what so many anti-cloud people are missing. For start ups, how can you possibly take a gamble on trying to predict what your traffic is going to be and paying upfront for dedicated servers. It puts you in a loose-loose situation - your product is not the right fit, you've got a dedicated server you are not using. Your product is a success - well now you need to go and order another server, better hope you can get it spun up in time before everything falls over. I worked at a startup where we saw 1000x increase in load in a day due to a customer's app going viral. On prem would have killed us, cloud saved us.
And you are bang on about managed services. RDS is expensive no doubt, but having your 4 person dev team burn through your seed round messing around with database back ups and failover is a far higher cost.
Of course some companies grow out of the cloud, they have full time ops engineers and can predict traffic ahead of time - for sure, go back to on prem. But for people to hold up articles like this and say "I always said cloud was pointless!" is just absurd.
OK, if you don't want to get good at planning as a company, that's fine. It's OK, just please don't pretend that it's impossible.
I worked at a startup that did the crazy scaling with physical servers just fine. No problem. The marketing department knew ahead of time when something was likely to go viral, IT/Dev knew how much capacity was needed per user and procurement knew lead time on hardware + could keep the vendors in the loop so that hardware would be ready on short notice.
With good internal communication it really is possible to be good at capacity management and get hardware on short notice if required.
Normally we would have servers racked and ready about 2 weeks after ordering, but it could be done in under half a day if required.
Edit: (we had our own datacentre and the suppliers were in a different state)
> The marketing department knew ahead of time when something was likely to go viral
That's fine when it's your product. The situation I'm talking about was a SaaS product providing backend services for customers app. Our customers didn't know if their app was going to go viral, there is no way we could have known. I maintain on-prem would have been totally inappropriate in this situation.
Also, "the marketing department knew ahead of time when something was likely to go viral"...that is quite a statement. They must have beeen some marketing department.
Depending on your business use-case, sharing a pool of IPs can have detrimental impact on access. For example, you may find the prior users were doing unauthorized security scans, spamming email, or hosting proxies.
i.e. if you get an IP block with a bad reputation, than you may find running a mail or VoIP server problematic.
If you are running purely user centric web services, than it doesn't matter as long as you are serving under around 30TiB/month.
There is also the issue of future offline decryption of sensitive records without using quantum resistant storage cryptography.
Rule #4: The first mistake in losing a marathon is moving the finish line. =)
Sounds to me like 37signals uses the risk aversion paradigm typical for stagnating businesses — instead of building and refining their strengths they're fixated on mitigating their weaknesses.
I've been following their move to on premise with interest and this was a great read. I'm curious how they are wiring up GitHub actions with their on premise deployment. How are they doing this?
The best I can think of for my own project is to run one of the self hosted GitHub actions runners on the same machine which could then run an action to trigger running the latest docker image.
Without something like that you miss the nice instant push model cloud gives you and you have to use the pull model of polling some service regularly for newer versions.
What do you do then if you don't mind me asking? I see this problem time and time again for self hosting and and using CI/CD - and every time it seems to either come down to exposing SSH, polling for new versions, or running the github action runner on the same machine as the app or service.
K8s has some cognitive overhead.
For simple deploy a docker client-server with docker-compose is a winner, see misterio[1] which basically leverage docker-compose +ssh
But when you need to guarantee system will auto-restart, and healthchecks and so on, K8s is the de-facto standard.
Helm template language (based on Go) is not ideal, but it difficult to replace K8s nowadays with simpler systems.
Timing is perfect, several of my clients fell through Series B and now look for cutting cloud costs (all way over provisioned for their traffic and customer numbers).
They're saying a VM takes seconds to boot up, yeah only because they run static dedicated servers, of course in the cloud if you wait for the VM to come online it's going to take longer, now how long does it takes them to add a new dedicated server and to add it to that pool of servers? days?
The other main issue I see is that they use chef and mrsk to setup applications, how is Filebeat setup? Is it chef that set it up or is it mrsk?
I started my career at a company that was excellent at capacity management and prediction. Using physical hardware they never hit a capacity problem, ever, despite growing like crazy. This did require the Marketing department being in close communication with the IT department about upcoming campaigns.
Everywhere else though have been terrible at predicting future capacity needs. As far as I can tell that's because they just use tools that gives a prediction based only on historical growth.
I guess my point is that it's entirely possible to be good at capacity management, and if you are then the lead time disadvantage of physical hardware can be completely negated.
It's easier to massively over provision or use the cloud than it is to get good at capacity planning. Same as how it's easier to use a GC than it is to do manual memory management.
They are all valid strategies, the key is picking the one that suits your situation.
If you need a small to medium amount of resources then the cloud is likely the cheapest option.
If you need a medium to high amount of resources then massively over provisioning can still be cheaper than using the cloud.
The cheapest option for anything medium size and above is physical servers with good capacity management.
Good capacity management requires good internal communication between business units. And making predictions based on expected/planned events not just historical data.
We have 7 racks, 3 people and actual hardware stuff is minuscule part of that. Few hundred VMs, anything from "just a software running on server" to k8s stack (biggest one is 30 nodes), 2 ceph cluster (our and clients), and a bunch of other shit
The stuff you mentioned is, amortized, around 20% (automation ftw). The rest of it is stuff that we would do in cloud anyway and cloud is in general harder to debug too (we have few smaller projects managed in cloud for customers.
We did calculation to move to cloud few times now, never was even close to profotable and we woudn't save on manpower anyway as 24/7 on-call is still required.
So I call bullshit on that.
If you are startup, by all means go cloud
If you are small, go ahead, not worth it.
If you have spiky load, cloud or hybrid will most likely be cheaper.
But if you have constant (by that I mean difference between peak and lowest traffic is "only" like 50-60%) load and need a bunch of servers to run it (say 3+ racks), it might actually be cheaper on-site.
Or a bunch of dedicated servers. Then you don't need to bother to manage hardware, and in case of boom can even scale relatively quickly
Every one of your examples in the second list is relevant to both on-prem and cloud. Also cloud also has on-call, just not for the hardware issues (still likely get a page for reduced availability of your software).
The problem here is “cloud” can mean different things.
If you’re taking about virtual machines running in a classical networking configuration then you’re not really leveraging “the cloud” — all you’ve done is shifted the location of your CPUs.
However if you’re using things like serverless, managed databases, SaaS, then most of the problems in the second list are either solved or much easier to solve in the cloud.
The problem with “the cloud” is you either need highly variable on-demand compute requirements or a complete re-architecture of your applications for cloud computing to make sense. And this is something that so many organisations miss.
I’ve lost count of the number of people who have tried to replicate their on-prem experience to cloud deployments and then came to the same conclusions as yourself. But that’s a little like trying to row a boat on land and then saying roads are a rubbish way to filter traffic. You just have to approach roads and rivers (or cloud and on-prem) deployments with a different mindset because they solve different problems.
This is simply not true unless you build in the cloud the same way you build on prem and just have a bunch of VMs. PaaS services get you away from server / network / driver maintenance and handle disaster recovery and replication out of the box. If you're primarily using IaaS, you likely shouldn't be in the cloud unless you're really leveraging the bursting capabilities.
“Just not for the hardware issues” is a huge deal though. That’s an entire skillset you can eliminate from your requirements if you’re only in the cloud. Depending on the scale of your team this might be a massive amount of savings.
At my last job, I would have happily gone into the office at 3am to swap a hard drive if it meant I didn't have to pay my AWS bill anymore. Computers are cheap. Backups are annoying, but you have to do them in the cloud too. (Deleting your Cloud SQL instance accidentally deletes all the automatic backups; so you have to roll your own if you care at all. Things like that; cloud providers remove some annoyance, and then add their own. If you operate software in production, you have to tolerate annoyance!)
Self-managed Kubernetes is no picnic, but nothing operational is ever a picnic. If it's not debugging a weird networking issue with tcpdump while sitting on the datacenter floor, it's begging your account rep for an update on your ticket twice a day for 3 weeks. Pick your poison.
The flip side is there is an entirely new skillset required to successfully leverage the cloud.
I suspect those cloud skills are also higher demand and therefore more expensive than hiring for people to handle hardware issues.
Personally, I appreciate the contrarian view because I think many businesses have been naive in their decision to move some of their workloads into the cloud. I'd like to see a broader industry study that shows what benefits are actually realized in the cloud.
Right. The skillset to pull the right drive from the server and put replacement one.
Says that you know nothing at all about actually running hardware as the bigger problem is by far "the DC might be drive 1-5 hour away" or "we have no spare parts at hand", not "fiddling with server is super hard"
Kubernetes is an amazing tool. Cloud computing is a powerful way to leverage a small team and prototype stuff quickly.
Ocean going ships are impressive pieces of kit. CNC machine tools are a powerful way to leverage small teams and manufacture high quality stuff quickly.
Now, telling every repair business in town they need robotic lathes and a fleet of major cargo ship is nonsense.
Why this kind of discourse thrives in software is beyond me.
Because I have some light sensitivity issues, I use browser extensions including Dark Reader and Midnight Lizard to enforce my own 'dark mode' across the web.
You can also use extensions like that to set the contrast to a more comfortable level on websites that are already dark.
I highly recommend this if you have light sensitivity issues like me.
Also note that when the contrast on a page is higher, you can generally get away with lower brightness. This is pretty convenient on phones, and probably more necessary as well since on a phone you're more likely to have an OLED screen that really surfaces extreme contrast like white on black.
There are some great web extensions for a lot of things. I don't use any of them because most of them require permissions to read data across all sites, which makes sense for them to work; but I'm not using any of them.
Fair enough. I only use long-lived, open-source browser extensions for that kind of global restyling. But of course there's still a risk that they could be compromised somehow.
37 Signals has many technical people and can afford to de-k8s. But K8s is designed for a totally different use case, for large corps where there are most non IT staff but where IT resources and standards need to be managed in a more central way. Most standard banks or large companies do not want to roll this stuff by hand, they care about STANDARDS!
That's the thing with technology though, it goes mainstream as adoption grows. RoR started small at 37Signals and eventually became a standard. MSRK might yet be one, there's no telling right now.
We did run a fast growing startup (sometimes 100% MoM jumps), 5M active users with 50k concurrent users (not visitors) with DB writes on 6 machines + 2 DB servers and $100M ARR 10 years back. If you're this size, MRSK makes total sense.
If you're much larger or growing >50% MoM continuously, K8 in the cloud makes more sense.
Neither cloud or on-prem fits everyones requirements, but at the end you need to know your environment well. One think i like from ECS and fargate is that you could use projects like my_init and get a container behave closer to a vm (run ssh, and other daemons at the same time).
Online deployments are discussed with some frequency. Tooling is talked about. Always as a "cluster". Why do we need clusters anymore? Scaling containers, scaling functions, scaling, scaling, cluster, cluster. We suffer so much tunnel vision about horizontal scaling when it's just unnecessary anymore for most applications. The cloud products are all about horizontal.
Do you really need more than the 400 threads >12TB ram with PBs of storage found in a reasonable high end server?
well... in my book, k8s has always been in "Dinosaur" category. somewhat useful, somewhat versatile, perhaps even good. quick glance on documentation eradicates any desire to learn the tech
Looks like an apple vs oranges comparison. They seem to have a low number of distinct services, so there isn't a real need for k3s/k8s (ie orchestration), on the other hand, they need config management.
have they thought to just run openstack on their own servers? everything I saw leads to me to use saltstack + openstack as they dont wanna be on cloud.
I have to imagine part of the reason they need to run so many servers is because they are running Ruby. The same application on say, Elixir, probably would require less hardware, reducing the cost of ECS or similar.
If I was Netflix I would de-cloud, but if I was a small team like 37signals then de-clouding is just insanity. I think DHH is either very stupid or extremely naive in his cost calculations or probably a mix of both. Hey and Basecamp customers will see many issues in the next few years and hackers will feast off their on-premise infrastructure.
They’ve had non-cloud infrastructure for a very long time. Their new orchestration methods notwithstanding, reliability and security are unlikely to suffer.
I find it very interesting that every conversation around k8s turns into a flame war between "just use k8s" and "no you don't need k8s at all". in reality, it is probably more of a spectrum than a boolean value. Also, it seems like people have a different definition on "using kubernetes":
* manage your own k8s cluster on your own hardware: probably pretty hardcore, I've never done this, I'd imagine it'd require me to know about the underlying hardware, diagnose issues and make sure the computer itself is running before managing k8s itself. only when the hardware is running properly I can focus on running k8s, which is also operationally expensive as well. Tbh I don't see a reason for a small/mid scale product to go this route unless they have a very specific reason.
* manage your own k8s cluster on cloud hardware: this seems like a bit simpler, meaning that I don't actually need to know much about running/managing hardware, that's what the provider does for me. I have done this before with k3s for some small applications, I have 2 small scale applications running like this for ~2 years now on Oracle's free ARM instances, I don't really do any active work/maintenance on them and they are running just fine. I'd probably have a lot of trouble if I wanted to upgrade k3s version for large scale applications, or usecases that have tight SLAs.
* use a managed k8s offering from a cloud provider: I've been doing this one the most, and I find it the easiest way to run my applications in a standardized way. I have experience in running applications on this setup for mid-scale as well as multi-national large scale consumer facing applications. Admitted, even though scale has been big, applications themselves have been mostly CRUD APIs, RabbitMQ / Kafka consumers and some scheduled jobs most of the time.
The trick seems to lie in the word "standardized" here: it is probably possible to run any application on any sort of hardware/orchestration combination, and MRSK could be a really nice solution for that as well. However, in my personal experience I have never managed to find an easier way of running multiple full applications, e.g. things that have multiple components such as web APIs, async workers, etc, in a standardized, replicable way.
I run the following components in one of my cloud-managed k8s clusters:
- Vault
- A few Laravel applications
- A few Golang APIs
- Grafana & Loki
- Metabase
Using k8s for situations like this where the specific requirements from the underlying infra is not very complex actually enables a lot of experimentation / progress simply thanks to the ecosystem. For all of these components there are either ready-made Helm charts that I can simply run a `helm install` and be 90% there, or it is trivial to build a simple K8s deployment configuration to run them. In my experience, I couldn't find anything that comes closer to this experience without having a large engineering team dedicated to solve a very specific problem. In fact, it has been pretty chill to rely on the managed k8s offerings and just rely on my applications.
It's a spectrum: there are a billion cases that don't need k8s, and there are probably a similar amount that could actually benefit from it. There's no absolute truth to it other than the fact that k8s is actually useful for certain cases and it is for sure not always "resume driven development". This doesn't mean that we shouldn't be looking out for better approaches, there's probably a lot of accidental complexity around it as well, but we could also acknowledge that it is actually a useful piece of software.
I don't know, I feel like I have to pick sides every time these sorts of stuff is being discussed as if there is an objective truth, but I am fairly convinced these days that there is a middleground that doesn't involve fanaticism in either direction.
I've never read such a ridiculous article. I really wanted to give them the benefit of the doubt but good lord. How is any of this simpler or better? It's like they prefer the pain of 2004 mixed with the worst parts of modern infrastructure. The dream of the 2000s really is alive in DHH's head, isn't it?
TFA links to their cloud spend for 2022[0], wherein lies the rub:
> In total, we spent $3,201,564 on all these cloud services in 2022. That comes out to $266,797 per month. Whew!
> For HEY, the yearly bill was $1,066,150 ($88,846/month) for production workloads only. That one service breaks down into big buckets as follows:
What the actual fuck? THREE MILLION DOLLARS? A million for their email service?? I have seen bills much larger, but for what 37signals does I am shocked. There is surely a ton of low hanging fruit to drop the bill despite the claim that it's as optimized as it can get. No way.
Even then, Hey is $99/year, and they claimed to have 25k users in the first month or so as of 2020, that's nearly $2.5MM. I presume they've grown since then. Another 2020 article[2] mentions 3/4 of their users have the iOS app, and the Android app currently shows "50k+ installs" so let's assume we're talking 200-400k users as a ceiling, ignoring attrition, which would pull $20-40MM. Even if it's half that, the cost doesn't seem unreasonable.
They're spending nearly $90k/mo on Hey. Of that the majority is RDS and OpenSearch. TFA makes it clear they know how to run MySQL, why on earth don't they stop running RDS? Both of these can easily be halved if they ran the services manually.
EKS is practically free so whatever. They state they have two deployments for ~$23k/mo total -- production is likely larger than staging but let's assume they're equal -- or ~$12k/mo each. A middle of the road EC2 instance like m4.2xlarge is less than $215/mo which gets more than enough cores and memory to run a rails app or two per node. That works out to around 55 nodes per environment. This benchmark[3] shows an m4.2xlarge can serve 172req/s via modern Ruby on Rails. At 500k users that works out to over 1600 request/user/day which seems excessive but likely within an order of magnitude of reality. These are the folks who wrote RoR so I would hope they can optimize this down further. <10000req/s for $12k/mo is pretty awful, and I'm being conservative.
Then let's talk about the $1MM/mo S3 bill. I'm not sure how to make 8PB cost that much but even the lightest touch at optimizing storage or compression or caching knocks the cost down.
This is all just nuts. There's no reason this all shouldn't be running on AWS or GKE with a much smaller bill. Their apps are predominantly CRUD, some email. Instead they replaced kubernetes with an in-house monstrosity.
They say that they've tried deploying their apps in all of:
* Their own Datacenter
* ECS
* GKE
* EKS
* and now back to their own Datacenter
Even with their new "de-clouded" deployment, it seems like they have created an absolutely immense amount of complexity to deploy what seems to be a variety of generic Ruby CRUD apps (I might be wrong but I didn't see anything different in the post).
They have a huge list of tools and integrations that they've tried out with crazy names; Capistrano, Chef, mrsk, Filebeat, traefik... It seems well on par complexity-wise with a full K8s deploy with all the bells and whistles (logging, monitoring, networking, etc.)
Google says that this company, 37signals, has 34 employees. This seems like such a monumental amount of orchestration and infra stuff unless they're deploying some crazy complex stuff they're not talking about.
Idk what the lesson is here, if there is one, but this seems like a poor example to follow.