The disk is exposed as a /dev/nvme* block device, and as such I/O goes through a separate driver. The earlier versions of the driver had a hard limit of 255 seconds before I/O operation times out. [0,1,2]
When the timeout triggers, it is treated as a hard failure and the filesystem gets remounted read-only. Meaning: if you have anything that writes intensively to an attached volume, C5/M5 instances are dangerous. We experimented with them for our early prometheus nodes. Not a good idea. Having the alerts for an entire fleet start flapping due to a seemingly nonsensical "out of disk, write error" monitoring node failure is not fun.[ß]
If you run stateless, in-memory only applications on them (preferably even without local logging), then you should be fine.
ß: We handle nodes dying. The byzantine failure mode of nodes suddenly spewing wrong data is harder to deal with.
Some more information from the original announcement:
(Disclaimer: I work at Canonical/Ubuntu, if that matters)
No need for a kernel with the default set - just put it in your kernel cmdline options in grub. Make an AMI with the change after that, and you're good to go.
Edit: My apologies. I didn't read your comment carefully enough, and didn't realize you had specifically called out needing to make these changes, and that older kernel versions had the 255 maximum. Leaving the above for posterity, and so people can notice I'm a bad reader :)
Did you see these issues even with it set to 255? That seems like it should have been enough timeout if everything is working normally. Perhaps small GP2 volumes that are out of credits with a large queue depth might see this? I can't say I've run into it yet.
Oh yes we did. To make sure we didn't just suffer from a single fluke, we read through docs and changed the value to 255 after the first time. Then waited. Didn't have to wait for long, the thing broke again in less than 3 weeks.
The workload was a pretty aggressively tuned prometheus. At that point we went into compaction after two weeks, so it would have been doing _very_ heavy I/O for a few days.
Some extra details about our monitoring setup here: https://smarketshq.com/wait-what-is-my-fleet-doing-2e7b1b06f... (We have 10s scraping intervals for everything and 1s for our most critical, highly latency-sensitive services.)
I can imagine some crazy random access, small-record, vectored IO from large thread pools. But that's not exactly common because most software that is IO-heavy tries really hard to avoid these things.
It was actually filed about EBS block disks using NVME (on these new instances, there is a hardware card that presents network EBS volumes as a PCI-E NVME device). In certain failure cases since this is a network block store, they can fail for some period of time exceeding this timeout.
The idea of this change is to ensure once they come back the machine is left in a usable state.
Of course, I would not expect to see this on Local NVME disks which is what they announced - that you can now get such instances with local disk as well as EBS.
In any event, the instance storage is unlikely to run afoul of any such timeouts since it’s more or less directly attached (albeit virtualized) and there’s no SAN involved.
Presumably the same applies to i3.metal as well, though I have not yet personally checked.
There's a lot to love about NVMe and timeouts are not actually part of the NVMe specification itself but rather a Linux driver construct. Unfortunately, early versions of the driver used an unsigned char for the timeout value and also have a pretty short timeout for network-based storage.
As mentioned elsewhere in the thread, recent AMIs are configured to avoid this problem out of the box.