
Debian-installer, mdadm configuration and the Bad Blocks Controversy - zdw
http://strugglers.net/~andy/blog/2020/09/13/debian-installer-mdadm-configuration-and-the-bad-blocks-controversy/
======
cbmuser
> If you’re on something with multiple virtual consoles (like if you’re
> sitting in front of a conventional PC) then you could switch to one of those
> after you’ve entered the MD configuration part and modify /tmp/mdadm.conf
> then. I don’t have that option because I’m on a serial console.

debian-installer starts a screen session these days when installing over a
serial console.

You can reach the previous and next consoles with Ctrl-a n and p.

~~~
grifferz
Hi, article author here. That's interesting! I'd possibly never noticed this
because my serial console session has always been itself inside screen (now
tmux with screen bindings), so doing that would only ever have sent me to my
own next window!

I'll have to try it again next time I am in there, with a double escape.

------
jwilk
> Currently the only way to remove a BBL from an array component is to stop
> the array and then assemble it

This worked for me without stopping the array:

    
    
      mdadm /dev/md127 --fail /dev/sda --remove /dev/sda --re-add /dev/sda --update=no-bbl
    

Not ideal, because it degrades the array for a moment, but hey.

~~~
grifferz
Hi, article author here. I just tried that out on an Ubuntu 18.04 machine but
it didn't work:

    
    
      $ sudo mdadm --fail /dev/md0 /dev/sdb1 --remove /dev/sdb1 --re-add /dev/sdb1 --update=no-bbl                                        
      mdadm: set /dev/sdb1 faulty in /dev/md0   
      mdadm: hot removed /dev/sdb1 from /dev/md0
      mdadm: --re-add for /dev/sdb1 to /dev/md0 is not possible
      $ sudo mdadm --add /dev/md0 /dev/sdb1 --update=no-bbl
      mdadm: --update in Manage mode only allowed with --re-add.
      $ sudo mdadm --add /dev/md0 /dev/sdb1
      $ sudo mdadm --examine-badblocks /dev/sdb1
      Bad-blocks list is empty in /dev/sdb1
    

Any ideas why? md0 is a simple RAID-1 metadata version 1.2 array.

~~~
sn
I added

[https://raw.githubusercontent.com/prgmrcom/ansible-role-
mdad...](https://raw.githubusercontent.com/prgmrcom/ansible-role-mdadm-bad-
blocks/master/files/fix-md0)

for you.

Apologies I haven't gotten to your PR's yet, but there is a ticket now in our
internal development queue to review and merge those.

~~~
grifferz
Interesting. What's the sfdisk for? Is that why my attempts to use --re-add
aren't working?

I also tried "mdadm --zero-superblock /dev/sdb1" to make mdadm forget that was
ever an array member, but that didn't get me any further.

I use Debian so this won't help me directly, but once I work out why I can't
re-add it will be possible to use something similar in a postinst hook to
rebuild all the arrays.

~~~
sn
The sfdisk is because as part of our kickstart file we only create the first
partition. After removing the disk from the raid we repartition it.

I don't see why this couldn't fix up a RAID device generated by the debian
installer. The device name could be parameterized if it's not always the same.

We run this script before there's any real data on the device so loss of
redundancy for a brief moment is not a huge deal - md0 is a very small device
so it doesn't take long to resync.

~~~
grifferz
But why do you repartition it?

Doesn't putting the exact same partition table on a device that already has a
partition table result in no actual changes?

~~~
sn
We repartition it because we're adding partitions.

We have one kickstart file that we use regardless of medium type. For SSDs, we
overprovision and leave unused space at the end. Some brands of SSDs were
failing before we did that. We don't need to overprovision hard drives.

You could remove the repartitioning and it would do the right thing for your
use case.

~~~
grifferz
Ah okay. It seems that my re-adds were failing on arrays that don't have a
bitmap. I was testing it on small arrays that don't get a bitmap by default.

------
nisa
bad blocks on harddisks are also often remapped when written to. However in my
(limited) experience once you get badblocks the disk detoriates often. So this
badblocks feature is probably not as useful because the harddisk firmware is
the only place that knows about badblocks and "lies" to you.

However for deciding if an old disk is still usable I've found that
overwriting the whole disks multiple times with the badblocks utility
([https://linux.die.net/man/8/badblocks](https://linux.die.net/man/8/badblocks))
- if the disk passes multiple runs without generating new bad blocks it's
often still usable

~~~
jwilk
linux.die.net man pages tend to be badly outdated. And you can't easily tell
this is the case, because because they strip version information.

For example, this one doesn't document the -B option, which was added in
2011(!):

[https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/...](https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=e53e8fb00970fc74)

Better man page link:

[https://man7.org/linux/man-
pages/man8/badblocks.8.html](https://man7.org/linux/man-
pages/man8/badblocks.8.html)

~~~
teddyh
If you use Debian, I would suggest
[https://manpages.debian.org/stable/e2fsprogs/badblocks.8.en....](https://manpages.debian.org/stable/e2fsprogs/badblocks.8.en.html)
(where you can replace “stable” with whatever designator you like.)

------
marcan_42
This is insane. I've been using md-raid for over a decade now and I had no
idea this crazy feature existed... Now I wonder if it has been involved in
some of the trouble I've had in the past.

------
AnssiH
> Even if the particular part of the device is unreadable, the operating
> system is supposed to try to write the correct data over the top.

According to md(4) man page, that's what it does. Only if the _write_ fails,
it is added to the bad block list.

FWIW, I've also encountered an issue related to BBL - in my case multiple
RAID6 components got bogus bad blocks (no idea from where, maybe a lower-level
controller issue or driver bug?) so there were logical sectors that had
insufficient non-"bad" backing sectors and the RAID array reshape process code
was seemingly not prepared for that, entering an infinite loop with 100% CPU
use and locking all use of the array.

------
KaiserPro
I know this is going to sound trite, but this is precisely the reason I don't
use mdadm

I really don't like lvm either, but in some cases its the only real way to
join larger hardware arrays together (for example some arrays will not let you
stripe multiple raid7s instead insisting that you have a 50drive raid 7 array.
No dice.)

This is where ZFS really does shine here. The control, visibility and
documentation are just wonderful. Its almost as good as reading a redbook.

For the hobbyist I don;'t see any reason why you'd really want to have an
mdadm array, barring the slight edge of performance when you are running at
>75% capacity.

~~~
McGlockenshire
> For the hobbyist I don;'t see any reason why you'd really want to have an
> mdadm array

The "don't even dare to use ZFS without dedicating 8-16 GB of memory to it"
warnings all over the ZFS documentation online has served as a very, very good
reason for me to not use ZFS for anything at all.

I'll stick with tooling that lets my homelab actually have enough RAM to run
the applications I want to run.

~~~
Sebb767
> don't even dare to use ZFS without dedicating 8-16 GB of ECC [!] memory to
> it

Fixed that for you.

Now, I've since build a storage server for my homelab using ZFS, but mdadm is
far more accessible. It also feels easier - I know what's on the drives and
how it is stored. In case of something strange happening with ZFS, I don't
have any hope except backups. Also, ZFS lacks support in most distributions
and somewhat obscure encryption capability.

I don't mean to bash ZFS here - it's great. But the GP feels like it's saying
"I don't see why would anyone not want to run a dedicated kubernetes cluster
with HA for running a program on Linux".

~~~
KaiserPro
> GP feels like it's saying "I don't see why would anyone not want to run a
> dedicated kubernetes cluster with HA for running a program on Linux".

Thats not quite right

Its more that the documentation, command line interface and general user setup
is far more friendly than mdadm. I've been a sysadmin for many years, and
looked after innumerable file servers, from clustered XFS (!) through many
years of lustre, plain old XFS/ext2/3/5 on HWraid to netapps&isilons, and all
sorts of flash in the pan appliances.

Barring GPFS, the only file system that has documentation and tooling that is
simple to understand and doesn't immediately bite you in the arse is ZFS.

Kubernetes is neither simple, easy or quick to setup, or run.

------
StillBored
Ugh, please don't "\--assume-clean" on new raid arrays. Let the os build the
array from the ground up. Then when its done scrub it.

from the documentation: "assume-clean: Tell mdadm that the array pre-existed
and is known to be clean. It can be useful when trying to recover from a major
failure as you can be sure that no data will be affected unless you actually
write to the array. It can also be used when creating a RAID1 or RAID10 if you
want to avoid the initial resync, however this practice — while normally safe
— is not recommended. Use this only if you really know what you are doing."

~~~
grifferz
It's on new devices in an installer context for a RAID-1 as the text you quote
describes, so yes I do know what I'm doing. But fair enough, it might give
other people bad ideas, so I'll remove it from the examples.

------
codetrotter
> the mdadm binary in d-i is compiled to have its config at /tmp/mdadm.conf. I
> don’t know why, but probably there is a good reason.

It’s certainly because all else of the installer file system hierarchy is
mounted read-only I would think.

~~~
JoshTriplett
When you're booted in the installer environment, /etc is writable. Some of the
sample code in the article shows writing a file in /etc, and the file write
works fine, mdadm just doesn't read the file.

I believe the reason is that mdadm in the installer exists to create arrays
for the installed system, so the installer's RAID configuration needs to
supply a configuration file, and for some reason that's handled by having the
installer's RAID configuration write it to /tmp/mdadm.conf and having mdadm
read it from there. (Rather than, say, writing it to a temporary file and
directing mdadm to read that temporary file.)

~~~
Filligree
Well, /etc/mdadm.conf is supposed to describe the running system.

Now, the installer doesn't use md, but from a strict philosophical perspective
this makes sense.

------
tremon
I didn't know about this "feature", but to be honest I have learned years ago
not to use the partitioning features of the debian installer. It's just
another crude tool to fight with, while the real tools are just a console
away.

Here's a list of the things the debian-installer can't do, or at least
couldn't do when I needed it:

\- accurate partitioning, such as choosing between GPT or MBR partitioning

\- filesystem features (btrfs subvolumes, ext4 flags)

\- creation of crypt volumes with keyfiles

\- having user control over the names of crypt devices

\- using an existing crypt device WITHOUT formatting it

\- mdadm write-intent bitmap enabling/disabling

~~~
grifferz
By choosing "expert" mode you can choose between any of the partition types
(much more besides mbr and gpt). In standard mode it will pick mbr if none of
your devices are bigger than the mbr partition limit (2TB?) otherwise it will
pick gpt, without offering you a choice.

You can choose expert mode at the start when you boot the installer or you can
switch to it from the menu once it's started.

I think you may be right about the other things, but this is not uncommon
amongst OS installers. It is very hard to offer the full range of
configurations available from all of the tools used to install a Linux system,
without just putting the user at a shell prompt and letting them get on with
it. Luckily that option is still there!

------
tzs
Instead of avoiding a BBL at creation time, it seems like it might be less
effort to just let it go ahead and create one then, and then remove it before
first putting the array into production.

The removal procedure described in the article seems pretty straightforward.
The only problem with it seems to be that have to take your server out of
production while you do it, which only applies after the server has entered
production.

~~~
grifferz
To remove the BBL from the devices that make up the array that the root
filesystem is on you would need to boot into a rescue environment and assemble
it there.

While the debian-installer does provide the rescue environment that you need
to do this, it is by no means trivial.

Speaking as the article author I find it way way simpler to create the arrays
from the shell at the start, rather than have to remember to reboot into d-i
again, choose the rescue mode, tell it not to automatically assemble, drop to
shell and assemble there. It's also faster — most of my servers take more than
60 seconds before even the serial-over-lan starts to display output!

------
sn
There is another discussion which may be worth having: why was this feature
enabled by default without considering how the software is used? What process
was missing? Is this still a problem, and if so is it pervasive to Linux
kernel development or is it specific to md?

------
louwrentius
If your hard drive has bad blocks it really needs to be replaced, it cannot be
trusted. You should never have to deal with this.

I don't understand the mentality of dealing with bad blocks in the first
place.

Storing data is easy. Storing data reliably is a whole other thing.

~~~
londons_explore
To add to this, even though bad blocks are allowed by the contract between a
device and the OS, because they are so rare, all code paths that deal with
them are riddled with bugs.

For example, if you power off a disk improperly, all un-acked writes are
allowed to contain a mix of the new data, the old data, or an unreadable
sector.

If, on Linux, you now try to read one of those unreadable sectors, you will
get an IO error - as designed. If you now try to write that sector, you won't
be able to. Why? Well the kernel cache is 4k sectors, whereas the disk is 512
byte sectors, so to write a disk sector it needs to _read_ the other nearby
sectors to populate the cache.

That pretty much means every time you power off a disk improperly, you'll get
a bunch of unreadable data and filesystem errors, even though the filesystem
is journaling so should be able to cope with a sudden poweroff just fine.

