

Warn HN: Resize of Digital Ocean instance corrupted filesystem beyond repair - oskarpearson

Careful with resizing on Digital Ocean...<p>A recent resize of a Digital Ocean instance corrupted the filesystem beyond repair. Support have offered me a $80 credit and offered to restore from a backup.<p>You should probably snapshot your instances before resizing, or migrate to a new instance manually.<p>(Thankfully I&#x27;ve built a replacement box from Ansible in the meantime.)<p>The upgrade process appears to have involved copying to different hardware across the network, since the resize operation took over 15 minutes. That copy seems to be incomplete, which led to a completely unrecoverable filesystem.<p>On coming back up, the system displayed &quot;DOROOT does not exist after resize&quot;. Running a filesystem check scrolled tens of thousands of e2fsck &quot;fixes&quot; for a period of 12 minutes.<p>As expected, the end result of that is that all &quot;files&quot; on the filesystem were in &#x2F;mnt&#x2F;lost+found with random names, and the data in them no doubt corrupt too.<p>Digital Ocean support does not appear to be able to re-copy or review the previous block device to determine the source of the problem. They also don&#x27;t appear to have logs of the resize operation.<p>Sure - it&#x27;s always possible the filesystem was irretrievably corrupted before reboot - but I think it&#x27;s pretty unlikely. Given that I&#x27;ve not been doing things like &#x27;dd if=&#x2F;dev&#x2F;urandom of=&#x2F;dev&#x2F;vda1&#x27; on there, that would probably indicate a hardware fault on their side anyway.<p>It&#x27;s worth noting that I rebooted the box successfully a few minutes before the resize, so a (journaled) e2fsck ran at that point. The filesystem was at least useable a few minutes before the resize.<p>(Ticket #633210 in case anyone from Digital Ocean wants to investigate.)
======
cat9
Nice of them to give you a credit, and a good decision from a customer service
standpoint, but I doubt they're at fault in any real way. It could be any
number of things, many of which are completely out of their control to do more
than mitigate and minimize, and thus part of working with real computing
systems at scale.

Cultivate healthy paranoia that systems will fail - because eventually, they
will, particularly if you run 100 of them or run them for several years or any
other "you have to survive 1000 coin tosses to miss the error" combinatoric
series. And always make a backup before doing system-changing events like
resizing a partition or reprovisioning a VM.

