Cloud infrastructure should be immutable

hvgk · on Dec 7, 2021

It should be but isn’t. Most of the abstractions are trying to add an immutability layer over something that isn’t immutable and is very definitely knee deep in locks, state and other difficult to resolve problems which compromise it entirely. Ergo you now have two layers of problems to deal with. If you buy too far in you’re living a lie.

Oh yes we will indeed just redeploy that container. Oh well no we can’t because the PV is stuck.

Oh yes we’ll just redeploy all the SQS queues. Oh no wait they have data in them.

sargun · on Dec 7, 2021

The article seems to be tying two ideas together - immutability and infrastructure as code. I’m unsure why the author decided to link the two ideas.

Immutability, IMHO, is a terrible idea for infrastructure as a general purpose rule. Making it more complicated for developers to deploy systems, and requiring deployments to go through many steps (as immutability typically requires a full blue / green deploy), means MTTR will be longer.

Also, trying to fix a problem in situ takes longer (because you don’t have the tools for it).

Lastly, immutability is a lie. If you’re using something like v8, over time, the JIT and GC may change behavior. The cloud vendor may deploy new software.

If your software can be deployed as a fully self contained, stateless application, perhaps you can ignore these problems, especially by constantly recycling infrastructure. Now, the real problems begin to occur when you stop recycling - that’s when really interesting longevity problems appear.

/rant

candiddevmike · on Dec 7, 2021

IMO immutability is a byproduct of adopting Terraform--it doesn't manage day 2 stuff like patching, so the solution in some people's mind is "immutability" instead of bringing in another tool to manage the day 2 stuff.

I personally see immutability as a sign of organizational immaturity around system administration, if you're concerned about drift on production systems you don't have the right config management and RBAC in place. It's never more efficient to replace a fleet vs patching it in place, especially from a time standpoint.

Spivak · on Dec 7, 2021

Patching is relatively easy to solve with immutable infrastructure. Build a new fully patched AMI every night and when you release you pick (version of code, version of AMI) to deploy.

candiddevmike · on Dec 7, 2021

I don't think the benefits of this outweigh the complexity and time costs to create and operate it. Having a scheduled task patch the servers regularly with extremely limited "break the glass" RBAC on your servers is better for 99% of environments.

Spivak · on Dec 7, 2021

There's really no complexity. Our deploys are blue-green (i.e two ASGs and a load balancer, no need to make it sound fancy). You need to do this anyway because instances in the cloud are inherently unreliable.

Deploying new code: publish a new container, scale up green, switch, whack blue at your leisure.

Patching: publish a new AMI, scale up green, switch, whack blue at your leisure.

tilolebo · on Dec 11, 2021

How do you handle stateful resources? Say for instance you have multiple elasticsearch clusters with terabytes of data on each cluster.

ehutch79 · on Dec 7, 2021

If you can patch it, doesn't that mean it's mutable?

Spivak · on Dec 7, 2021

Nope, because once it's deployed to production it will never be changed without another deploy.

ehutch79 · on Dec 7, 2021

Then you're not patching it?

Spivak · on Dec 7, 2021

How our system works:

1. Every night we build a new freshly patched Ubuntu image.

2. Three times a week we run a deploy that rebuilds our entire infrastructure in place with the latest Ubuntu image as the base [1].

3. Once the deploy runs we never run apt install or apt update again.

So rather than

    server.update()

we do

    server = server.update()

[1] This sounds fancy but I promise it's stupid simple, all the complexity is in the auto scaling groups and the load balancers which we don't manage. And it's the exact same procedure to do code deploys which we do multiple times a day.

nwmcsween · on Dec 7, 2021

Well the NixOS idea is the opposite of what you describe, minus the implementation which does have mutable data.

KaiserPro · on Dec 7, 2021

I broadly agree. However there are a few issues that I have with this piece

> When an error occurs, operations can just redeploy

I mean no. If there is an error you should figure out what the problem is first.

Also if operations just blindly redeploy when things go wrong you have some large culture issues that need to be solved first.

> Operations can quickly return to a previous state

I kinda agree. However the hardest thing to fix normally is the state that was fucked up by what ever outage that happened. Assuming that you have the correct mechanism for inter-service comms, its really simple to kill/restart services, so long as you have no data in flight.

Whats not easy to fix is if the state that was stored is causing the crash. Those are normally the big problems. Bad messages/config causing rolling outages.

> Your operations team can adopt a workflow > [..]You were dropped with a bunch of co-workers and all of them did their own thing[..]

I mean thats is a massive problem right there. There is either the right way to deploy things, or there isn't. That's a culture problem and needs to be fixed, and fixing it is very very hard.

On the hole I broadly agree that Prod should be >95% immutable, and nowadays its fairly simple to achieve.

One thing I would recommend is running "prod" in two regions with some sort of health checked loadbalancer infront. This allows testing of IaC stuff in prod without bringing the entire service to a grinding halt when stuff fails. Yes managing state can be much harder, and its important to partition your state by region so you are not hampered by cross region syncing.

But the advantages are that you are much better placed to survive an outage, or resource crunch. But most importantly it allows you to completely kill and rebuild a "prod" instance from almost scratch. This is key for DR preparedness.

PretzelPirate · on Dec 7, 2021

> I mean no. If there is an error you should figure out what the problem is first.

Figuring out what the problem is may take time. If a problem is impacting a customer’s experience, I’d rather recover faster and then pull metadata later to try and understand the issue.

packetlost · on Dec 7, 2021

I think it ultimately comes down to the situation and how well you know your application. Sometimes restarting it will fuck up the data even more. It almost certainly indicates an engineering flaw, but that doesn't mean it won't happen.

KaiserPro · on Dec 7, 2021

> Figuring out what the problem is may take time. If a problem is impacting a customer’s experience, I’d rather recover faster and then pull metadata later to try and understand the issue.

How do you know it'll fix the problem, if you don't know the cause?

In my experience, about 50% of the time to fix an incident is cleaning up the mess you made performing a fix based on an hypothesis that turns out to be wrong. That means longer downtime.

A classic example of this is when a 60 disk raid unit crashed when performing a rebuild under load. This happened to my coworker when they were doing the daily replacing a disk task. They put in the disks, shut the draw, pressed the button that changes the activity lights to see that everything was working, and the thing froze.

They rightly shat their pants. They panicked, rebooted the array and attached fileserver. A load of data was lost (the file servers had at the time the maximum amount of ram you could shove in a standard 2u two proc intel server).

When it happened again, another coworker waited for the raid enclosure to reboot. Low and behold the SAS links came back and we were able to do an online fsck inside 3 minutes. no data lost.

If your facing an outage, breathing space to gather thoughts is time well invested. Blind hacking, or indeed decision paralysis is damaging

papito · on Dec 7, 2021

Sure, but it's easy to fall into the trap of "the fix is to just restart it". That's dereliction of duty.

lazyant · on Dec 7, 2021

> I mean no. If there is an error you should figure out what the problem is first.

First thing you want to do (after confirming the issue and communicating) is to "fly the plane" and try to contain/mitigate the issue before debugging for a RC.

marcinzm · on Dec 7, 2021

>When an error occurs, operations can just redeploy

Unless there was data corruption or changes caused by the first issue.

>When a disaster occurs, operations can just deploy in another region

Except all the data that is in the old region in databases, ecr repos, s3, etc, etc.

>Operations can quickly return to a previous state

Unless something like a database migration was done.

gatestone · on Dec 7, 2021

Immutable means "something that can not be changed" and is used somewhat strangely here. Better be more concrete:

Yes, we want to have declarative configurations and idempotent "terraform apply".

Yes we want to have containers and subsystems with no persistent state, and careful architecture for persistent storage volumes and database services.

And please remember to say "prevent_destroy", "ignore_changes" etc for that persistent data where appropriate. And remember that the automation that creates your system in one command will be able to destroy it in one command, if you say so.

goodpoint · on Dec 7, 2021

"immutability" is a misnomer: what you want is controlled changes, and with the correct granularity.

papito · on Dec 7, 2021

I mean, this is a common problem with architecture astronomy. It's inherently leaky. Our job is to simplify, not to abstract everything into "class Thing()".

tuananh · on Dec 7, 2021

> When a disaster occurs, operations can just deploy in another region

lots of resources are region-bounded. it's just not that easy.

hvgk · on Dec 7, 2021

This is one that gets me every time. ”oh we’ll just roll out our shit in middle-of-nowhere-west-1c if it all goes to crap”. First time someone does that DR drill they find out there’s a core service missing they have built their entire empire on.

Also there’s contention whenever that happens. Last AWS outage we experienced, entire availability zones had no capacity available. If they did it took forever to spin anything up and migrate volume snapshots over as well as everyone else was doing the same thing at the same time.

reph2097 · on Dec 7, 2021

You can't find a fix if you "just redeploy".

To find a root cause, I need to look at the broken thing. If you hit restart, the only thing I can guarantee is that the fault will occur again.

This is newbie hipster webdeveloper BS.

synthc · on Dec 8, 2021

This blog posts mixes up infrastrucure as code (a good thing) and immutable infrastructure (a good thing, for some use cases)

Immutable infrastructure is great for stateless services, to ensure you know exactly what versions are used and to avoid people from making changes on the server. For databases and other stateful services, such as message brokers, it's less ideal.

hactually · on Dec 8, 2021

What a terrible article - old school approach to testing and having a "devops" team but trying to tie into the new concept of immutability.

All the tech in the world can't mend a broken process or org chart!

yevpats · on Dec 7, 2021

Yes, in an ideal world it should. Also, I agree as someone here noted immutability and IaC are related terms but not the same and the article mostly covers IaC best practices (in a good way!).

BTW - drift detection is a harder problem than it seems and has a lot challenges running this accurately at scale. I wrote about this here - https://www.cloudquery.io/blog/announcing-cloudquery-terrafo...

Full disclosure - Im the founder of CloudQuery

cortesoft · on Dec 7, 2021

Even if your infrastructure is immutable, most things will still have mutable data... you can't just redeploy through any issue, because the issue could be with customer data.

devnull255 · on Dec 8, 2021

Generally, that mutable data is on externally mounted volumes or stores though and not in the container itself, right? Except for what is read into container memory and written out to temporary or work files.

synthc · on Dec 8, 2021

True, but restarting a service, remounting all the data, waiting until everything has synced etc, just to apply a security patch is a big downside of the immutable approach

cortesoft · on Dec 8, 2021

Right, but my point is that a bad deploy can cause data corruption, which usually isn't fixed by just redeploying the previous version.

withinboredom · on Dec 8, 2021

So fun story: I did something like this where the image was immutable. Due to a time-bomb type bug in the code, it created and killed hundreds (thousands?) of Azure VMs in the space of a couple of days. We never had to really pay for it above normal costs due to there always being the same number of running VMs.

Azure reached out though, and in so many words asked: wtaf are y’all doing and can you please stop?