Hacker News new | past | comments | ask | show | jobs | submit login
Debian-installer, mdadm configuration and the Bad Blocks Controversy (strugglers.net)
94 points by zdw on Sept 15, 2020 | hide | past | favorite | 49 comments

> If you’re on something with multiple virtual consoles (like if you’re sitting in front of a conventional PC) then you could switch to one of those after you’ve entered the MD configuration part and modify /tmp/mdadm.conf then. I don’t have that option because I’m on a serial console.

debian-installer starts a screen session these days when installing over a serial console.

You can reach the previous and next consoles with Ctrl-a n and p.

Hi, article author here. That's interesting! I'd possibly never noticed this because my serial console session has always been itself inside screen (now tmux with screen bindings), so doing that would only ever have sent me to my own next window!

I'll have to try it again next time I am in there, with a double escape.

> Currently the only way to remove a BBL from an array component is to stop the array and then assemble it

This worked for me without stopping the array:

  mdadm /dev/md127 --fail /dev/sda --remove /dev/sda --re-add /dev/sda --update=no-bbl
Not ideal, because it degrades the array for a moment, but hey.

Hi, article author here. I just tried that out on an Ubuntu 18.04 machine but it didn't work:

  $ sudo mdadm --fail /dev/md0 /dev/sdb1 --remove /dev/sdb1 --re-add /dev/sdb1 --update=no-bbl                                        
  mdadm: set /dev/sdb1 faulty in /dev/md0   
  mdadm: hot removed /dev/sdb1 from /dev/md0
  mdadm: --re-add for /dev/sdb1 to /dev/md0 is not possible
  $ sudo mdadm --add /dev/md0 /dev/sdb1 --update=no-bbl
  mdadm: --update in Manage mode only allowed with --re-add.
  $ sudo mdadm --add /dev/md0 /dev/sdb1
  $ sudo mdadm --examine-badblocks /dev/sdb1
  Bad-blocks list is empty in /dev/sdb1
Any ideas why? md0 is a simple RAID-1 metadata version 1.2 array.

Hmm, maybe some race condition that sometimes breaks --re-add immediately after --remove? Might be worth trying separating it into two commands, i.e.:

  mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1
  mdadm /dev/md0 --re-add /dev/sdb1 --update=no-bbl
I have a simple RAID-1 with metadata version 1.2, too.

My testing was done on Debian unstable (Linux v5.8, mdadm v4.1).

I added


for you.

Apologies I haven't gotten to your PR's yet, but there is a ticket now in our internal development queue to review and merge those.

Interesting. What's the sfdisk for? Is that why my attempts to use --re-add aren't working?

I also tried "mdadm --zero-superblock /dev/sdb1" to make mdadm forget that was ever an array member, but that didn't get me any further.

I use Debian so this won't help me directly, but once I work out why I can't re-add it will be possible to use something similar in a postinst hook to rebuild all the arrays.

The sfdisk is because as part of our kickstart file we only create the first partition. After removing the disk from the raid we repartition it.

I don't see why this couldn't fix up a RAID device generated by the debian installer. The device name could be parameterized if it's not always the same.

We run this script before there's any real data on the device so loss of redundancy for a brief moment is not a huge deal - md0 is a very small device so it doesn't take long to resync.

But why do you repartition it?

Doesn't putting the exact same partition table on a device that already has a partition table result in no actual changes?

We repartition it because we're adding partitions.

We have one kickstart file that we use regardless of medium type. For SSDs, we overprovision and leave unused space at the end. Some brands of SSDs were failing before we did that. We don't need to overprovision hard drives.

You could remove the repartitioning and it would do the right thing for your use case.

Ah okay. It seems that my re-adds were failing on arrays that don't have a bitmap. I was testing it on small arrays that don't get a bitmap by default.

bad blocks on harddisks are also often remapped when written to. However in my (limited) experience once you get badblocks the disk detoriates often. So this badblocks feature is probably not as useful because the harddisk firmware is the only place that knows about badblocks and "lies" to you.

However for deciding if an old disk is still usable I've found that overwriting the whole disks multiple times with the badblocks utility (https://linux.die.net/man/8/badblocks) - if the disk passes multiple runs without generating new bad blocks it's often still usable

linux.die.net man pages tend to be badly outdated. And you can't easily tell this is the case, because because they strip version information.

For example, this one doesn't document the -B option, which was added in 2011(!):


Better man page link:


If you use Debian, I would suggest https://manpages.debian.org/stable/e2fsprogs/badblocks.8.en.... (where you can replace “stable” with whatever designator you like.)

> However in my (limited) experience once you get badblocks the disk detoriates often.

This used to be my experience as well. However I've had a couple of WD Reds that developed a few (~20) bad sectors, one two years ago the other three. Both have been stable since, and I run periodic scrubs on them.

When the first one happened I had planned to swap it out, but was a bit low on cash right then so waited a bit. Then I saw it didn't deteriorate further so I go curious and decided to wait. Still waiting.

Feels like handling OOM; an abstraction breakdown you think you can cleverly work around but is actually not easily handled at all.

Generally, a small handful of bad blocks (and they come in groups due to the physical format and sector size emulation) aren't an indicator of imminent trouble. If they can be written to properly, then that could just be random data corruption caused by e.g. an unlucky hard shutdown, vibration, or other random thing. If they can't be written to, it could just be very minor drive damage or a weak area of the platter (and in this case, the drive will remap them and increase the bad block count in the SMART data)

Now if you start getting into the hundreds, then yeah, the drive is dying.

Lots of drives don't seem to measure their own power supply... That means if you run them on a laptop with a slightly out of spec power supply, you can get the drive marking hundreds of bad sectors, yet when you put them in another machine they're fine again.

If you're designing something as complex as a drive, you would have thought they'd have a bit of electronics to test the power supply and refuse to work (with some relevant error code) if it's out of spec.

This is insane. I've been using md-raid for over a decade now and I had no idea this crazy feature existed... Now I wonder if it has been involved in some of the trouble I've had in the past.

> Even if the particular part of the device is unreadable, the operating system is supposed to try to write the correct data over the top.

According to md(4) man page, that's what it does. Only if the write fails, it is added to the bad block list.

FWIW, I've also encountered an issue related to BBL - in my case multiple RAID6 components got bogus bad blocks (no idea from where, maybe a lower-level controller issue or driver bug?) so there were logical sectors that had insufficient non-"bad" backing sectors and the RAID array reshape process code was seemingly not prepared for that, entering an infinite loop with 100% CPU use and locking all use of the array.

I know this is going to sound trite, but this is precisely the reason I don't use mdadm

I really don't like lvm either, but in some cases its the only real way to join larger hardware arrays together (for example some arrays will not let you stripe multiple raid7s instead insisting that you have a 50drive raid 7 array. No dice.)

This is where ZFS really does shine here. The control, visibility and documentation are just wonderful. Its almost as good as reading a redbook.

For the hobbyist I don;'t see any reason why you'd really want to have an mdadm array, barring the slight edge of performance when you are running at >75% capacity.

You should see what some of the other alternatives do. I just decommissioned a server and pulled the drives out of it. One of them had SMART data indicating that it reached the "drive failing, replace now" stage six months prior. The hardware RAID controller in the server had been content to give no indication of this at all.

Had another one where the drives were in two external containers and one of them lost power for a moment. The hardware RAID controller thereby marked that half of the drives as failed. This was too many failures to run the array, but it also had no mechanism to assemble an array using the perfectly operational drives it had marked as failed, so that was a restore from backup.

Meanwhile mdadm now supports --write-journal which moots the only real advantage hardware RAID ever held.

> For the hobbyist I don;'t see any reason why you'd really want to have an mdadm array

The "don't even dare to use ZFS without dedicating 8-16 GB of memory to it" warnings all over the ZFS documentation online has served as a very, very good reason for me to not use ZFS for anything at all.

I'll stick with tooling that lets my homelab actually have enough RAM to run the applications I want to run.

> don't even dare to use ZFS without dedicating 8-16 GB of ECC [!] memory to it

Fixed that for you.

Now, I've since build a storage server for my homelab using ZFS, but mdadm is far more accessible. It also feels easier - I know what's on the drives and how it is stored. In case of something strange happening with ZFS, I don't have any hope except backups. Also, ZFS lacks support in most distributions and somewhat obscure encryption capability.

I don't mean to bash ZFS here - it's great. But the GP feels like it's saying "I don't see why would anyone not want to run a dedicated kubernetes cluster with HA for running a program on Linux".

> GP feels like it's saying "I don't see why would anyone not want to run a dedicated kubernetes cluster with HA for running a program on Linux".

Thats not quite right

Its more that the documentation, command line interface and general user setup is far more friendly than mdadm. I've been a sysadmin for many years, and looked after innumerable file servers, from clustered XFS (!) through many years of lustre, plain old XFS/ext2/3/5 on HWraid to netapps&isilons, and all sorts of flash in the pan appliances.

Barring GPFS, the only file system that has documentation and tooling that is simple to understand and doesn't immediately bite you in the arse is ZFS.

Kubernetes is neither simple, easy or quick to setup, or run.

that warning is the same for any File server. if you want performance, then you need ram. otherwise you don't have enough space to keep hot blocks.

ZFS works as well as any FS in places with 8 gigs of ram. Hell I even run a KVM node with just 16 gigs of ram. Its working just fine.

> if you want performance, then you need ram

This is not what the documentation or the thousands of posts from ZFS experts out there imply. The direct implication is that ZFS REQUIRES that much memory for day to day mundane operation.

The reason you don't use mdadm is because it keeps track of bad blocks by default, although that can be disabled, but when using the debian installer to set up arrays it sets up the config file in /tmp then blows away any changes made when dropping out of the installer to modify the temp conf on return to the installer, making it impossible to set up a md array that doesn't track bad blocks when using a serial console (although if you aren't, you can open up another terminal to do it) for the install?

That's precisely the reason you don't use mdadm?

I think this has nothing to do with MD vs ZFS.

The key issue is that I think it's crazy to tinker around with 'bad blocks' in the first place.

A hard drive with bad blocks that it cannot conceal from the end-user is broken and not to be trusted. Replace it.

Because as a hobbyist, the fact that I can expand my RAID6 array one drive at a time is, for me, mdadm's killer app.

Ugh, please don't "--assume-clean" on new raid arrays. Let the os build the array from the ground up. Then when its done scrub it.

from the documentation: "assume-clean: Tell mdadm that the array pre-existed and is known to be clean. It can be useful when trying to recover from a major failure as you can be sure that no data will be affected unless you actually write to the array. It can also be used when creating a RAID1 or RAID10 if you want to avoid the initial resync, however this practice — while normally safe — is not recommended. Use this only if you really know what you are doing."

It's on new devices in an installer context for a RAID-1 as the text you quote describes, so yes I do know what I'm doing. But fair enough, it might give other people bad ideas, so I'll remove it from the examples.

> the mdadm binary in d-i is compiled to have its config at /tmp/mdadm.conf. I don’t know why, but probably there is a good reason.

It’s certainly because all else of the installer file system hierarchy is mounted read-only I would think.

When you're booted in the installer environment, /etc is writable. Some of the sample code in the article shows writing a file in /etc, and the file write works fine, mdadm just doesn't read the file.

I believe the reason is that mdadm in the installer exists to create arrays for the installed system, so the installer's RAID configuration needs to supply a configuration file, and for some reason that's handled by having the installer's RAID configuration write it to /tmp/mdadm.conf and having mdadm read it from there. (Rather than, say, writing it to a temporary file and directing mdadm to read that temporary file.)

Well, /etc/mdadm.conf is supposed to describe the running system.

Now, the installer doesn't use md, but from a strict philosophical perspective this makes sense.

> > the mdadm binary in d-i is compiled to have its config at /tmp/mdadm.conf. I don’t know why, but probably there is a good reason

As others have said, /etc is writable during the installer.

More to the point: whatever other reason is likely bad. An app not using it's own config file in the standard location, losing data clobbering it's own config is, from experience (Red Hat used to do this a lot and our customers HATED it), a mistaken decision.

The installer filesystem should be an initramfs, so it should all be in RAM and writable (at least in the netboot case, but I don't see why CDs would do it differently; I expect the extra storage is just used for packages), at least for d-i stuff.

Even for proper (not install only) live CDs where the rootfs is really read only, I don't think I've ever seen a distro not mount an overlayfs on top so that changes can be made in RAM. Too much stuff breaks on your average distro if you really try to have a read only rootfs.

I didn't know about this "feature", but to be honest I have learned years ago not to use the partitioning features of the debian installer. It's just another crude tool to fight with, while the real tools are just a console away.

Here's a list of the things the debian-installer can't do, or at least couldn't do when I needed it:

- accurate partitioning, such as choosing between GPT or MBR partitioning

- filesystem features (btrfs subvolumes, ext4 flags)

- creation of crypt volumes with keyfiles

- having user control over the names of crypt devices

- using an existing crypt device WITHOUT formatting it

- mdadm write-intent bitmap enabling/disabling

By choosing "expert" mode you can choose between any of the partition types (much more besides mbr and gpt). In standard mode it will pick mbr if none of your devices are bigger than the mbr partition limit (2TB?) otherwise it will pick gpt, without offering you a choice.

You can choose expert mode at the start when you boot the installer or you can switch to it from the menu once it's started.

I think you may be right about the other things, but this is not uncommon amongst OS installers. It is very hard to offer the full range of configurations available from all of the tools used to install a Linux system, without just putting the user at a shell prompt and letting them get on with it. Luckily that option is still there!

Instead of avoiding a BBL at creation time, it seems like it might be less effort to just let it go ahead and create one then, and then remove it before first putting the array into production.

The removal procedure described in the article seems pretty straightforward. The only problem with it seems to be that have to take your server out of production while you do it, which only applies after the server has entered production.

To remove the BBL from the devices that make up the array that the root filesystem is on you would need to boot into a rescue environment and assemble it there.

While the debian-installer does provide the rescue environment that you need to do this, it is by no means trivial.

Speaking as the article author I find it way way simpler to create the arrays from the shell at the start, rather than have to remember to reboot into d-i again, choose the rescue mode, tell it not to automatically assemble, drop to shell and assemble there. It's also faster — most of my servers take more than 60 seconds before even the serial-over-lan starts to display output!

That won't work in automated provisioning of "bare metal". It needs to happen automatically during mass-provisioning of systems.

There is another discussion which may be worth having: why was this feature enabled by default without considering how the software is used? What process was missing? Is this still a problem, and if so is it pervasive to Linux kernel development or is it specific to md?

If your hard drive has bad blocks it really needs to be replaced, it cannot be trusted. You should never have to deal with this.

I don't understand the mentality of dealing with bad blocks in the first place.

Storing data is easy. Storing data reliably is a whole other thing.

To add to this, even though bad blocks are allowed by the contract between a device and the OS, because they are so rare, all code paths that deal with them are riddled with bugs.

For example, if you power off a disk improperly, all un-acked writes are allowed to contain a mix of the new data, the old data, or an unreadable sector.

If, on Linux, you now try to read one of those unreadable sectors, you will get an IO error - as designed. If you now try to write that sector, you won't be able to. Why? Well the kernel cache is 4k sectors, whereas the disk is 512 byte sectors, so to write a disk sector it needs to read the other nearby sectors to populate the cache.

That pretty much means every time you power off a disk improperly, you'll get a bunch of unreadable data and filesystem errors, even though the filesystem is journaling so should be able to cope with a sudden poweroff just fine.

Every hard disk has bad blocks and reserve space of good blocks to allocate from when one of these bad blocks is written upon. Also the drive's firmware has an internal list of known bad blocks discovered while a low level format was performed at the factory, which is why consumers should not perform a low level format ever again, unless they are aware of this and know how to write the updated bad blocks list back into the drive.

Either way, bad blocks on a hard disk are an unavoidable reality. The same is true for SSD's. It is normal and should be expected to run into bad blocks on the media during normal use of the hard drive, whereby the correctly programmed firmware will remap the pointer to a good block from the reserve and either tell the OS driver to attempt the write again (if the drive's write cache has been disabled, as it should be), or silently re-map to the good block and write it with the data from the drive's write cache. If power is lost during this, no amount of redundancy will save one. The only solution is ZFS or Oracle ASM.

It has always been this way, and will always be like this as long as we don't have media which cannot have bad "blocks", for lack of a better term.

As the article tries to state, the problem isn't that bad blocks can happen. The problems with md's bad blocks log are:

- Most of the time the entries don't correspond to actual bad blocks.

- It's buggy because it copies BBL between devices leading to another instance of "says the blocks are bad but they actually aren't"

- Once you've worked out that the entries are bogus it is still very hard to remove the entries or the BBL as a whole

- It's overly quiet in what it does. I monitor my syslogs but many people don't. There are many documented instances of people carrying entries in a BBL for years without knowing.

- Once MD thinks there is a bad block it renders that block on that device useless for recovery purposes. So in a two device array you just lost redundancy.

So this article has very little to do with the concept of bad blocks; it's about how md's BBL feature can be dangerous and why I don't want it enabled until it's fixed.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact