Hacker News new | past | comments | ask | show | jobs | submit login

Depending on which kernel version you are, C5 (and M5) instances can be real sources of pain.

The disk is exposed as a /dev/nvme* block device, and as such I/O goes through a separate driver. The earlier versions of the driver had a hard limit of 255 seconds before I/O operation times out. [0,1,2]

When the timeout triggers, it is treated as a hard failure and the filesystem gets remounted read-only. Meaning: if you have anything that writes intensively to an attached volume, C5/M5 instances are dangerous. We experimented with them for our early prometheus nodes. Not a good idea. Having the alerts for an entire fleet start flapping due to a seemingly nonsensical "out of disk, write error" monitoring node failure is not fun.[ß]

If you run stateless, in-memory only applications on them (preferably even without local logging), then you should be fine.

0: https://bugs.launchpad.net/ubuntu/bionic/+source/linux/+bug/...

1: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729119

2: https://www.reddit.com/r/aws/comments/7s5gui/c5_instances_nv...

ß: We handle nodes dying. The byzantine failure mode of nodes suddenly spewing wrong data is harder to deal with.

Ubuntu has a specific kernel for AWS, and partners with AWS to optimise the kernel for AWS environments. Part of that is fixing issues exactly like this. That issue was fixed as per the bug that you linked.

Some more information from the original announcement: https://blog.ubuntu.com/2017/04/05/ubuntu-on-aws-gets-seriou...

(Disclaimer: I work at Canonical/Ubuntu, if that matters)

The EC2 User Guide includes documentation on how to avoid these issues:


No need for a kernel with the default set - just put it in your kernel cmdline options in grub. Make an AMI with the change after that, and you're good to go.

Edit: My apologies. I didn't read your comment carefully enough, and didn't realize you had specifically called out needing to make these changes, and that older kernel versions had the 255 maximum. Leaving the above for posterity, and so people can notice I'm a bad reader :)

Did you see these issues even with it set to 255? That seems like it should have been enough timeout if everything is working normally. Perhaps small GP2 volumes that are out of credits with a large queue depth might see this? I can't say I've run into it yet.

> Did you see these issues even with it set to 255?

Oh yes we did. To make sure we didn't just suffer from a single fluke, we read through docs and changed the value to 255 after the first time. Then waited. Didn't have to wait for long, the thing broke again in less than 3 weeks.

The workload was a pretty aggressively tuned prometheus. At that point we went into compaction after two weeks, so it would have been doing _very_ heavy I/O for a few days.

Some extra details about our monitoring setup here: https://smarketshq.com/wait-what-is-my-fleet-doing-2e7b1b06f... (We have 10s scraping intervals for everything and 1s for our most critical, highly latency-sensitive services.)

I am curious, what kind of write characteristics can manage to saturate a 255s timeout on a storage device that does 10k+ iops and gigabytes per second throughput? Normally writes slowing down leads to backpressure because the syscalls issueing them take longer to return.

I can imagine some crazy random access, small-record, vectored IO from large thread pools. But that's not exactly common because most software that is IO-heavy tries really hard to avoid these things.

According to the bug here: https://bugs.launchpad.net/ubuntu/bionic/+source/linux/+bug/...

It was actually filed about EBS block disks using NVME (on these new instances, there is a hardware card that presents network EBS volumes as a PCI-E NVME device). In certain failure cases since this is a network block store, they can fail for some period of time exceeding this timeout.

The idea of this change is to ensure once they come back the machine is left in a usable state.

Of course, I would not expect to see this on Local NVME disks which is what they announced - that you can now get such instances with local disk as well as EBS.

Ah yeah, the article was about local NVMe, so the concern is probably not relevant here.

Unrelated: love your pwgen username

EBS probably has really terrible tail latencies.

Interesting - I’ve been using NVMe devices on Linux for a couple years now and never run into this problem. And an I/O timeout of 255 seconds seems really high to begin with. Is there frequently that much latency in the EBS storage backplane? (We also run c5.9xl instances and have not yet experienced the phenomenon you discuss.)

In any event, the instance storage is unlikely to run afoul of any such timeouts since it’s more or less directly attached (albeit virtualized) and there’s no SAN involved.

The nvme disks are local, not remote EBS. Latency will be the PCI bus.

EBS is exposed as nvme devices on c5 and m5 as well, which is what I assume otterley is talking about.

Presumably the same applies to i3.metal as well, though I have not yet personally checked.

Yes, the root volume on i3.metal is exposed as NVMe and is EBS-backed.

Sorry for that. The timeout behavior on earlier kernels is a bit of a pain.

There's a lot to love about NVMe and timeouts are not actually part of the NVMe specification itself but rather a Linux driver construct. Unfortunately, early versions of the driver used an unsigned char for the timeout value and also have a pretty short timeout for network-based storage.

As mentioned elsewhere in the thread, recent AMIs are configured to avoid this problem out of the box.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact