This is all well and good if you’re working on applications, which to be fair is most software.
I have yet to see a systematized agile approach that works for building infrastructure, things like utilities and system software. Software that if it breaks, it breaks things in big ways. You can’t iterate on that and get feedback like you would a CRUD app. Furthermore, you can’t time box that sort of software development into sprints in any way that’s not artificial and arbitrary. If you need to rearchitect the software to fit a new constraint imposed on the system, you’re basically starting a new project. If anyone knows of an agile way to manage that sort of software development, I’d like to hear it.
Unit tests, functional tests, automated CI/CD, etc all apply when building utilities or system software.
Things like the DORA metrics are very often applied to “things where if they break they break in a big way” - to help teams design toward having the confidence to deploy without causing outages and to have limited impact when something does fail, and faster recovery etc.
I agree on the timeboxing/sprints. I also think those are unfit for regular software development though.
For everything else, there are many techniques that can be used depending on what it is that you're doing. They are generally the same used on agile and XP software development. How to apply them may differ but the ideas are the same.
Blue/green deployments, canary deployments, expand/contract, isolated prod-like environments, branching by abstraction are all great for reducing risk and blast radius when iterating. Test-first approaches where applicable help catch issues earlier.
It takes more work and more time than big bang releases but the risk of outages can be reduced to statistical error.
Forgive the Scrum-centric view of the world here, but these are some thoughts from having worked on a workload (back before we called them containers) orchestration system this way.
"Furthermore, you can’t time box that sort of software development into sprints in any way that’s not artificial and arbitrary." You absolutely can. Sprints are by their very nature artificial and arbitrary. They're there to establish a reporting interval, with the ideal report being "we have something to deliver" and the pathological report being "we pulled the andon cord because the facts changed in a profound way, and we need to regroup." Whether or not a sprint results in the ideal has the probability of a coin toss, and if you're wildly on either side of that, consider changing the rules of your sprint framework: make the interval longer/shorter, do less/more, take on less/more risk, whatever.
There's zero reason to sit down and decree, "A sprint is two weeks. We must deliver something every sprint." That isn't at all how any of this works. In fact, that's how to make the framework fight you and burn you out.
For systems software, the deliverables can very easily be formal specifications, integration tests, SDK examples, and prose describing the results of design feasibility experiments, all in addition to the runnable systems software itself. Documentation also fits here. Lots of other things do too.
(As an aside: If you can, pull your SDK examples from your integration tests. This will help your documentation not be stale.)
It's also worth noting that stories also fit in well, but they must be handled correctly. A story is a starting place for a conversation with the stakeholders, not itself a deliverable. It's a placeholder so that your product owner can do some digging and figure out what the stakeholders need so that they can turn those needs into a product backlog item with acceptance tests for the thing that ultimately gets delivered.
Oh, also, if a product backlog item makes it to the current sprint without acceptance tests, that's a beach day. (Or perhaps a day to get caught up on support tickets while the product owner straightens that out, but going to the beach and doing some stunt kite flying is way more fun and comes highly recommended.)
It could very well be that design feasibility experiments change the facts, to the point where a lot of rework is needed. That's fine. The entire point of the exercise is to try to clamp down on the risk inherent in rework (if not eliminate it entirely, but that's often impossible): by stopping the current sprint in its tracks, it communicates to everyone that extra care is needed in order to proceed. The analysis of what went wrong and how to resolve it can happen outside the sprint framework; it's probably inappropriate to get too meta in the product backlog.
To quote Fred Brooks: "Plan to throw one away." You will anyway. You'll likely do it a lot. That's honestly probably best for systems software. Just make sure this expectation is communicated clearly and actually planned!
Also, kill your bug tracker. This is especially true for systems software and other kinds of software that provide a platform for other folks to run code: bugs in your software will amplify the bugs in your users' applications. If you need to, let the bug kill the sprint; it's better to let it happen now than to have to do a lot more rework later.
As a corollary, let the bug cause the facts to change so profoundly as to require you to re-plan your product backlog. This is fine! This is the entire point of the exercise. You've identified risk to the product, and the framework is in place to give you the tools and rituals to handle it.
(While I'm here: you're hopefully not re-planning your product backlog at the end of every sprint. That's ideally at most a once-yearly activity, but if you're finding that your management is requiring you to re-order things every few weeks at their whim, guess where you just identified the risk to the product.)
It does help. It sounds like there are a lot of areas where agile can be modified to fit the work being done.
Unfortunately I can implement absolutely zero of your ideas because every single one of them requires leadership to change how they perceive the way work is done. It is absolutely unacceptable for a developer to inconvenience leadership. An attempt to do that will be dismissed, coded as unprofessional, lead to dismissal.
Yeah, agile is not really prescriptive about much; it's just a way of thinking about how to get a handle on the risk and complexity in delivering a product that your customer will find useful.
Unfortunately, Scrum, a major framework that's built around those ideas, is also kind of, uh, syndicalist. It requires a lot of worker control and self-organization to pull off well. XP ("extreme programming"), another framework that's more prescriptive about the engineering practices than the project management side of things, also suffers from this problem.
I have yet to see a systematized agile approach that works for building infrastructure, things like utilities and system software. Software that if it breaks, it breaks things in big ways. You can’t iterate on that and get feedback like you would a CRUD app. Furthermore, you can’t time box that sort of software development into sprints in any way that’s not artificial and arbitrary. If you need to rearchitect the software to fit a new constraint imposed on the system, you’re basically starting a new project. If anyone knows of an agile way to manage that sort of software development, I’d like to hear it.