It should be but isn’t. Most of the abstractions are trying to add an immutability layer over something that isn’t immutable and is very definitely knee deep in locks, state and other difficult to resolve problems which compromise it entirely. Ergo you now have two layers of problems to deal with. If you buy too far in you’re living a lie.
Oh yes we will indeed just redeploy that container. Oh well no we can’t because the PV is stuck.
Oh yes we’ll just redeploy all the SQS queues. Oh no wait they have data in them.
The article seems to be tying two ideas together - immutability and infrastructure as code. I’m unsure why the author decided to link the two ideas.
Immutability, IMHO, is a terrible idea for infrastructure as a general purpose rule. Making it more complicated for developers to deploy systems, and requiring deployments to go through many steps (as immutability typically requires a full blue / green deploy), means MTTR will be longer.
Also, trying to fix a problem in situ takes longer (because you don’t have the tools for it).
Lastly, immutability is a lie. If you’re using something like v8, over time, the JIT and GC may change behavior. The cloud vendor may deploy new software.
If your software can be deployed as a fully self contained, stateless application, perhaps you can ignore these problems, especially by constantly recycling infrastructure. Now, the real problems begin to occur when you stop recycling - that’s when really interesting longevity problems appear.
IMO immutability is a byproduct of adopting Terraform--it doesn't manage day 2 stuff like patching, so the solution in some people's mind is "immutability" instead of bringing in another tool to manage the day 2 stuff.
I personally see immutability as a sign of organizational immaturity around system administration, if you're concerned about drift on production systems you don't have the right config management and RBAC in place. It's never more efficient to replace a fleet vs patching it in place, especially from a time standpoint.
Patching is relatively easy to solve with immutable infrastructure. Build a new fully patched AMI every night and when you release you pick (version of code, version of AMI) to deploy.
I don't think the benefits of this outweigh the complexity and time costs to create and operate it. Having a scheduled task patch the servers regularly with extremely limited "break the glass" RBAC on your servers is better for 99% of environments.
There's really no complexity. Our deploys are blue-green (i.e two ASGs and a load balancer, no need to make it sound fancy). You need to do this anyway because instances in the cloud are inherently unreliable.
Deploying new code: publish a new container, scale up green, switch, whack blue at your leisure.
Patching: publish a new AMI, scale up green, switch, whack blue at your leisure.
1. Every night we build a new freshly patched Ubuntu image.
2. Three times a week we run a deploy that rebuilds our entire infrastructure in place with the latest Ubuntu image as the base [1].
3. Once the deploy runs we never run apt install or apt update again.
So rather than
server.update()
we do
server = server.update()
[1] This sounds fancy but I promise it's stupid simple, all the complexity is in the auto scaling groups and the load balancers which we don't manage. And it's the exact same procedure to do code deploys which we do multiple times a day.
I broadly agree. However there are a few issues that I have with this piece
> When an error occurs, operations can just redeploy
I mean no. If there is an error you should figure out what the problem is first.
Also if operations just blindly redeploy when things go wrong you have some large culture issues that need to be solved first.
> Operations can quickly return to a previous state
I kinda agree. However the hardest thing to fix normally is the state that was fucked up by what ever outage that happened. Assuming that you have the correct mechanism for inter-service comms, its really simple to kill/restart services, so long as you have no data in flight.
Whats not easy to fix is if the state that was stored is causing the crash. Those are normally the big problems. Bad messages/config causing rolling outages.
> Your operations team can adopt a workflow
> [..]You were dropped with a bunch of co-workers and all of them did their own thing[..]
I mean thats is a massive problem right there. There is either the right way to deploy things, or there isn't. That's a culture problem and needs to be fixed, and fixing it is very very hard.
On the hole I broadly agree that Prod should be >95% immutable, and nowadays its fairly simple to achieve.
One thing I would recommend is running "prod" in two regions with some sort of health checked loadbalancer infront. This allows testing of IaC stuff in prod without bringing the entire service to a grinding halt when stuff fails. Yes managing state can be much harder, and its important to partition your state by region so you are not hampered by cross region syncing.
But the advantages are that you are much better placed to survive an outage, or resource crunch. But most importantly it allows you to completely kill and rebuild a "prod" instance from almost scratch. This is key for DR preparedness.
> I mean no. If there is an error you should figure out what the problem is first.
Figuring out what the problem is may take time. If a problem is impacting a customer’s experience, I’d rather recover faster and then pull metadata later to try and understand the issue.
I think it ultimately comes down to the situation and how well you know your application. Sometimes restarting it will fuck up the data even more. It almost certainly indicates an engineering flaw, but that doesn't mean it won't happen.
> Figuring out what the problem is may take time. If a problem is impacting a customer’s experience, I’d rather recover faster and then pull metadata later to try and understand the issue.
How do you know it'll fix the problem, if you don't know the cause?
In my experience, about 50% of the time to fix an incident is cleaning up the mess you made performing a fix based on an hypothesis that turns out to be wrong. That means longer downtime.
A classic example of this is when a 60 disk raid unit crashed when performing a rebuild under load. This happened to my coworker when they were doing the daily replacing a disk task. They put in the disks, shut the draw, pressed the button that changes the activity lights to see that everything was working, and the thing froze.
They rightly shat their pants. They panicked, rebooted the array and attached fileserver. A load of data was lost (the file servers had at the time the maximum amount of ram you could shove in a standard 2u two proc intel server).
When it happened again, another coworker waited for the raid enclosure to reboot. Low and behold the SAS links came back and we were able to do an online fsck inside 3 minutes. no data lost.
If your facing an outage, breathing space to gather thoughts is time well invested. Blind hacking, or indeed decision paralysis is damaging
> I mean no. If there is an error you should figure out what the problem is first.
First thing you want to do (after confirming the issue and communicating) is to "fly the plane" and try to contain/mitigate the issue before debugging for a RC.
Immutable means "something that can not be changed" and is used somewhat strangely here. Better be more concrete:
Yes, we want to have declarative configurations and idempotent "terraform apply".
Yes we want to have containers and subsystems with no persistent state, and careful architecture for persistent storage volumes and database services.
And please remember to say "prevent_destroy", "ignore_changes" etc for that persistent data where appropriate. And remember that the automation that creates your system in one command will be able to destroy it in one command, if you say so.
I mean, this is a common problem with architecture astronomy. It's inherently leaky. Our job is to simplify, not to abstract everything into "class Thing()".
This is one that gets me every time. ”oh we’ll just roll out our shit in middle-of-nowhere-west-1c if it all goes to crap”. First time someone does that DR drill they find out there’s a core service missing they have built their entire empire on.
Also there’s contention whenever that happens. Last AWS outage we experienced, entire availability zones had no capacity available. If they did it took forever to spin anything up and migrate volume snapshots over as well as everyone else was doing the same thing at the same time.
This blog posts mixes up infrastrucure as code (a good thing) and immutable infrastructure (a good thing, for some use cases)
Immutable infrastructure is great for stateless services, to ensure you know exactly what versions are used and to avoid people from making changes on the server.
For databases and other stateful services, such as message brokers, it's less ideal.
Yes, in an ideal world it should. Also, I agree as someone here noted immutability and IaC are related terms but not the same and the article mostly covers IaC best practices (in a good way!).
Even if your infrastructure is immutable, most things will still have mutable data... you can't just redeploy through any issue, because the issue could be with customer data.
Generally, that mutable data is on externally mounted volumes or stores though and not in the container itself, right? Except for what is read into container memory and written out to temporary or work files.
True, but restarting a service, remounting all the data, waiting until everything has synced etc, just to apply a security patch is a big downside of the immutable approach
So fun story: I did something like this where the image was immutable. Due to a time-bomb type bug in the code, it created and killed hundreds (thousands?) of Azure VMs in the space of a couple of days. We never had to really pay for it above normal costs due to there always being the same number of running VMs.
Azure reached out though, and in so many words asked: wtaf are y’all doing and can you please stop?
Oh yes we will indeed just redeploy that container. Oh well no we can’t because the PV is stuck.
Oh yes we’ll just redeploy all the SQS queues. Oh no wait they have data in them.