While initially everyone blamed Microsoft and then quickly pointed the finger at CrowdStrike, I'd like to call out Microsoft especially their Azure division for making the recovery process unnecessarily difficult.
1) A key recovery step requires a snapshot to be take of the disk. The Portal GUI is basically locking up, so scripting is the only way to do this for thousands of VMs. This command is undocumented and has random combinations of strings as inputs that should be enums. Tab-complete doesn't work! See: https://learn.microsoft.com/en-us/powershell/module/az.compu...
E.g.: What are the accepted values for the -CreateOption parameter? Who knows! Good luck using this in a hurry. No stress, just apply it to a production database server at 1 am in the morning.
2) There has been a long-standing bug where VMs can't have their OS disk swapped out unless the replacement disk matches its properties exactly. For comparison, VMware vSphere has no such restrictions.
3) It's basically impossible to get to the recovery consoles of VMs, especially VMs stuck in reboot loops. The serial console output is buggy, often filled with gibberish, and doesn't scroll back far enough to be useful. Boot diagnostics is an optional feature for "reasons". Etc..
4) It's absurdly difficult to get a flat list of all "down" VMs across many subscriptions or resource groups. Again, compare with VMware vSphere where this is trivial. Instead of a simple portal dashboard / view, you have to write this monstrous Resource Graph query:
Resources
| where type =~ 'microsoft.compute/virtualmachines'
| project subscriptionId, resourceGroup, Id = tolower(id), PowerState = tostring( properties.extended.instanceView.powerState.code)
| join kind=leftouter (
HealthResources
| where type =~ 'microsoft.resourcehealth/availabilitystatuses'
| where tostring(properties.targetResourceType) =~ 'microsoft.compute/virtualmachines'
| project targetResourceId = tolower(tostring(properties.targetResourceId)), AvailabilityState = tostring(properties.availabilityState))
on $left.Id == $right.targetResourceId
| project-away targetResourceId
| where PowerState != 'PowerState/deallocated'
| where AvailabilityState != 'Available'
1) A key recovery step requires a snapshot to be take of the disk. The Portal GUI is basically locking up, so scripting is the only way to do this for thousands of VMs. This command is undocumented and has random combinations of strings as inputs that should be enums. Tab-complete doesn't work! See: https://learn.microsoft.com/en-us/powershell/module/az.compu...
E.g.: What are the accepted values for the -CreateOption parameter? Who knows! Good luck using this in a hurry. No stress, just apply it to a production database server at 1 am in the morning.
2) There has been a long-standing bug where VMs can't have their OS disk swapped out unless the replacement disk matches its properties exactly. For comparison, VMware vSphere has no such restrictions.
3) It's basically impossible to get to the recovery consoles of VMs, especially VMs stuck in reboot loops. The serial console output is buggy, often filled with gibberish, and doesn't scroll back far enough to be useful. Boot diagnostics is an optional feature for "reasons". Etc..
4) It's absurdly difficult to get a flat list of all "down" VMs across many subscriptions or resource groups. Again, compare with VMware vSphere where this is trivial. Instead of a simple portal dashboard / view, you have to write this monstrous Resource Graph query: