debian-installer starts a screen session these days when installing over a serial console.
You can reach the previous and next consoles with Ctrl-a n and p.
I'll have to try it again next time I am in there, with a double escape.
This worked for me without stopping the array:
mdadm /dev/md127 --fail /dev/sda --remove /dev/sda --re-add /dev/sda --update=no-bbl
$ sudo mdadm --fail /dev/md0 /dev/sdb1 --remove /dev/sdb1 --re-add /dev/sdb1 --update=no-bbl
mdadm: set /dev/sdb1 faulty in /dev/md0
mdadm: hot removed /dev/sdb1 from /dev/md0
mdadm: --re-add for /dev/sdb1 to /dev/md0 is not possible
$ sudo mdadm --add /dev/md0 /dev/sdb1 --update=no-bbl
mdadm: --update in Manage mode only allowed with --re-add.
$ sudo mdadm --add /dev/md0 /dev/sdb1
$ sudo mdadm --examine-badblocks /dev/sdb1
Bad-blocks list is empty in /dev/sdb1
mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1
mdadm /dev/md0 --re-add /dev/sdb1 --update=no-bbl
My testing was done on Debian unstable (Linux v5.8, mdadm v4.1).
Apologies I haven't gotten to your PR's yet, but there is a ticket now in our internal development queue to review and merge those.
I also tried "mdadm --zero-superblock /dev/sdb1" to make mdadm forget that was ever an array member, but that didn't get me any further.
I use Debian so this won't help me directly, but once I work out why I can't re-add it will be possible to use something similar in a postinst hook to rebuild all the arrays.
I don't see why this couldn't fix up a RAID device generated by the debian installer. The device name could be parameterized if it's not always the same.
We run this script before there's any real data on the device so loss of redundancy for a brief moment is not a huge deal - md0 is a very small device so it doesn't take long to resync.
Doesn't putting the exact same partition table on a device that already has a partition table result in no actual changes?
We have one kickstart file that we use regardless of medium type. For SSDs, we overprovision and leave unused space at the end. Some brands of SSDs were failing before we did that. We don't need to overprovision hard drives.
You could remove the repartitioning and it would do the right thing for your use case.
However for deciding if an old disk is still usable I've found that overwriting the whole disks multiple times with the badblocks utility (https://linux.die.net/man/8/badblocks) - if the disk passes multiple runs without generating new bad blocks it's often still usable
For example, this one doesn't document the -B option, which was added in 2011(!):
Better man page link:
This used to be my experience as well. However I've had a couple of WD Reds that developed a few (~20) bad sectors, one two years ago the other three. Both have been stable since, and I run periodic scrubs on them.
When the first one happened I had planned to swap it out, but was a bit low on cash right then so waited a bit. Then I saw it didn't deteriorate further so I go curious and decided to wait. Still waiting.
Now if you start getting into the hundreds, then yeah, the drive is dying.
If you're designing something as complex as a drive, you would have thought they'd have a bit of electronics to test the power supply and refuse to work (with some relevant error code) if it's out of spec.
According to md(4) man page, that's what it does. Only if the write fails, it is added to the bad block list.
FWIW, I've also encountered an issue related to BBL - in my case multiple RAID6 components got bogus bad blocks (no idea from where, maybe a lower-level controller issue or driver bug?) so there were logical sectors that had insufficient non-"bad" backing sectors and the RAID array reshape process code was seemingly not prepared for that, entering an infinite loop with 100% CPU use and locking all use of the array.
I really don't like lvm either, but in some cases its the only real way to join larger hardware arrays together (for example some arrays will not let you stripe multiple raid7s instead insisting that you have a 50drive raid 7 array. No dice.)
This is where ZFS really does shine here. The control, visibility and documentation are just wonderful. Its almost as good as reading a redbook.
For the hobbyist I don;'t see any reason why you'd really want to have an mdadm array, barring the slight edge of performance when you are running at >75% capacity.
Had another one where the drives were in two external containers and one of them lost power for a moment. The hardware RAID controller thereby marked that half of the drives as failed. This was too many failures to run the array, but it also had no mechanism to assemble an array using the perfectly operational drives it had marked as failed, so that was a restore from backup.
Meanwhile mdadm now supports --write-journal which moots the only real advantage hardware RAID ever held.
The "don't even dare to use ZFS without dedicating 8-16 GB of memory to it" warnings all over the ZFS documentation online has served as a very, very good reason for me to not use ZFS for anything at all.
I'll stick with tooling that lets my homelab actually have enough RAM to run the applications I want to run.
Fixed that for you.
Now, I've since build a storage server for my homelab using ZFS, but mdadm is far more accessible. It also feels easier - I know what's on the drives and how it is stored. In case of something strange happening with ZFS, I don't have any hope except backups. Also, ZFS lacks support in most distributions and somewhat obscure encryption capability.
I don't mean to bash ZFS here - it's great. But the GP feels like it's saying "I don't see why would anyone not want to run a dedicated kubernetes cluster with HA for running a program on Linux".
Thats not quite right
Its more that the documentation, command line interface and general user setup is far more friendly than mdadm. I've been a sysadmin for many years, and looked after innumerable file servers, from clustered XFS (!) through many years of lustre, plain old XFS/ext2/3/5 on HWraid to netapps&isilons, and all sorts of flash in the pan appliances.
Barring GPFS, the only file system that has documentation and tooling that is simple to understand and doesn't immediately bite you in the arse is ZFS.
Kubernetes is neither simple, easy or quick to setup, or run.
ZFS works as well as any FS in places with 8 gigs of ram. Hell I even run a KVM node with just 16 gigs of ram. Its working just fine.
This is not what the documentation or the thousands of posts from ZFS experts out there imply. The direct implication is that ZFS REQUIRES that much memory for day to day mundane operation.
Google "zfs memory requirements" and pick one. Go argue with them, not with me. The top result is a Stack Overflow answer, perhaps you can start there.
That's not documentation, and the posts agree with me, counter to your original claim.
> Go argue with them, not with me. The top result is a Stack Overflow answer, perhaps you can start there.
The top result literally agrees with me. It links to a page talking about dedup.
So, Google actually proved you wrong. You're spreading FUD.
That's precisely the reason you don't use mdadm?
The key issue is that I think it's crazy to tinker around with 'bad blocks' in the first place.
A hard drive with bad blocks that it cannot conceal from the end-user is broken and not to be trusted. Replace it.
from the documentation: "assume-clean: Tell mdadm that the array pre-existed and is known to be clean. It can be useful when trying to recover from a major failure as you can be sure that no data will be affected unless you actually write to the array. It can also be used when creating a RAID1 or RAID10 if you want to avoid the initial resync, however this practice — while normally safe — is not recommended. Use this only if you really know what you are doing."
It’s certainly because all else of the installer file system hierarchy is mounted read-only I would think.
I believe the reason is that mdadm in the installer exists to create arrays for the installed system, so the installer's RAID configuration needs to supply a configuration file, and for some reason that's handled by having the installer's RAID configuration write it to /tmp/mdadm.conf and having mdadm read it from there. (Rather than, say, writing it to a temporary file and directing mdadm to read that temporary file.)
Now, the installer doesn't use md, but from a strict philosophical perspective this makes sense.
As others have said, /etc is writable during the installer.
More to the point: whatever other reason is likely bad. An app not using it's own config file in the standard location, losing data clobbering it's own config is, from experience (Red Hat used to do this a lot and our customers HATED it), a mistaken decision.
Even for proper (not install only) live CDs where the rootfs is really read only, I don't think I've ever seen a distro not mount an overlayfs on top so that changes can be made in RAM. Too much stuff breaks on your average distro if you really try to have a read only rootfs.
Here's a list of the things the debian-installer can't do, or at least couldn't do when I needed it:
- accurate partitioning, such as choosing between GPT or MBR partitioning
- filesystem features (btrfs subvolumes, ext4 flags)
- creation of crypt volumes with keyfiles
- having user control over the names of crypt devices
- using an existing crypt device WITHOUT formatting it
- mdadm write-intent bitmap enabling/disabling
You can choose expert mode at the start when you boot the installer or you can switch to it from the menu once it's started.
I think you may be right about the other things, but this is not uncommon amongst OS installers. It is very hard to offer the full range of configurations available from all of the tools used to install a Linux system, without just putting the user at a shell prompt and letting them get on with it. Luckily that option is still there!
The removal procedure described in the article seems pretty straightforward. The only problem with it seems to be that have to take your server out of production while you do it, which only applies after the server has entered production.
While the debian-installer does provide the rescue environment that you need to do this, it is by no means trivial.
Speaking as the article author I find it way way simpler to create the arrays from the shell at the start, rather than have to remember to reboot into d-i again, choose the rescue mode, tell it not to automatically assemble, drop to shell and assemble there. It's also faster — most of my servers take more than 60 seconds before even the serial-over-lan starts to display output!
I don't understand the mentality of dealing with bad blocks in the first place.
Storing data is easy. Storing data reliably is a whole other thing.
For example, if you power off a disk improperly, all un-acked writes are allowed to contain a mix of the new data, the old data, or an unreadable sector.
If, on Linux, you now try to read one of those unreadable sectors, you will get an IO error - as designed. If you now try to write that sector, you won't be able to. Why? Well the kernel cache is 4k sectors, whereas the disk is 512 byte sectors, so to write a disk sector it needs to read the other nearby sectors to populate the cache.
That pretty much means every time you power off a disk improperly, you'll get a bunch of unreadable data and filesystem errors, even though the filesystem is journaling so should be able to cope with a sudden poweroff just fine.
Either way, bad blocks on a hard disk are an unavoidable reality. The same is true for SSD's. It is normal and should be expected to run into bad blocks on the media during normal use of the hard drive, whereby the correctly programmed firmware will remap the pointer to a good block from the reserve and either tell the OS driver to attempt the write again (if the drive's write cache has been disabled, as it should be), or silently re-map to the good block and write it with the data from the drive's write cache. If power is lost during this, no amount of redundancy will save one. The only solution is ZFS or Oracle ASM.
It has always been this way, and will always be like this as long as we don't have media which cannot have bad "blocks", for lack of a better term.
- Most of the time the entries don't correspond to actual bad blocks.
- It's buggy because it copies BBL between devices leading to another instance of "says the blocks are bad but they actually aren't"
- Once you've worked out that the entries are bogus it is still very hard to remove the entries or the BBL as a whole
- It's overly quiet in what it does. I monitor my syslogs but many people don't. There are many documented instances of people carrying entries in a BBL for years without knowing.
- Once MD thinks there is a bad block it renders that block on that device useless for recovery purposes. So in a two device array you just lost redundancy.
So this article has very little to do with the concept of bad blocks; it's about how md's BBL feature can be dangerous and why I don't want it enabled until it's fixed.