Hacker News new | past | comments | ask | show | jobs | submit login
Linux boot partitions and how to set them up (0pointer.net)
259 points by throw7 on Nov 3, 2022 | hide | past | favorite | 179 comments



I'm a huge fan of UKI - it's part of the underlying 'magic' of ZFSBootMenu (https://github.com/zbm-dev/zfsbootmenu/). We ship a single EFI file that is a full Linux kernel, a semi-custom initramfs and an embedded command line. With that, we can fully support root-on-ZFS because we don't have to re-implement a complex filesystem in a bootloader ... like GRUB. Because we're not trying to re-implement ZFS (or any other modern/complex filesystem), we can ALWAYS be current. If a new version of ZFS is released, all we have to do is build a new EFI executable with that baked in to the embedded initramfs.

There are serious concerns with SecureBoot and being able to unlock your own bootloader - but UEFI itself is a nice universal base to target in 2022.


UEFI as such is horrible. (Just have a look at the spec; 2.5 k pages of stuff that looks like copy&paste of Windows APIs)

But its boot mechanic, and the added crypto stuff (secure / attested boot) is really nice.

Using UKIs is like boot should have worked since the beginning: You just copy a (signed) boot image onto the (firmware managed) boot partition. Done.


Most of the things in the UEFI spec are wishful thinking and completely unused by real-world hardware.


You can do the same just without UKI by using the same parts (kernel, initramfs + cmdline) as separate files. There's the kernel's EFI stub and other minimal bootloaders (like systemd-boot) if you need a more complete menu.


Creating and shipping a UKI means that people can just write it to a USB drive that has a vfat partition marked as an ESP. As long as the filename is BOOTX64.EFI, modern x86_64 hardware will automatically find and boot that if no other working EFI entries can be found. Since the kernel commandline is embedded in the UKI, there's no manual/error prone setup to be able to boot the file with any required/extra arguments.

https://github.com/zbm-dev/zfsbootmenu/wiki/Portable-ZFSBoot...


ZFSBootMenu can also render itself as an ordinary initramfs image and maintain a separate copy of the kernel it's built against. These can be launched with some intermediate bootloader (e.g., rEFInd, syslinux or systemd-boot) or the kernel's built-in EFI stub can be run from the firmware. However, some Dell firmware seems incapable of properly passing command-line arguments to UEFI executables; a UKI that encodes its command line is needed on affected systems.


AFAIK Debian still doesnt have any integration available for handling thee integration of systemd-boot & kernel packages: there's nothing to maintain the loader/entries files that systemd-boot expects! It's really a shame because systemd-boot is 10x simpler and 100x more plesant to work with than grub & it's multiple overlapping but different obtuse config handling shell scripts. Bootctl is excellent & understandable, the entries are human readable/authorable.

I've been on systemd-boot for a long while now. For a while I was just hand maintaining vmlinuzs & loader entries, copying & editing stuff on /boot/efi. Easy but inelegant & I'd forget the very simple steps.

I've been copy pasting (well, its in my ansible now) /etc/kernel/postinst.d/ hooks from stackoverflow, which writes these files, & that greatly simplified life. It's a jank hurdle & I wish my os would actually support this wonderful easy to use tool. Systemd-boot is so much less obtuse, such a breath or fresh air, after years of grub (and many of uboot as well, but that's a different sector).

I made the jump ~two years ago to a single partition, just the ESP partition. Theres a warning abiut not being able to set permissions properly but its worked fine & been so much more pleasant to operate. Very strongly recommend. It just worked for me on Debian, no real fiddling.

https://github.com/filakhtov/kernel-postinst-d/blob/master/9...


OTOH it's nice to have access to the boot loader via serial console. The system developers judgement is that it's up to the firmware to provide serial console access if needed. Well I'll just get my chequebook out then...

Also it would be nice to be able to interact with the boot loader on a modern laptop display without having to get out a magnifying glass. Another problem that is deigned to be the fault of the firmware.

One of the great things about open source operating systems is that people step up to provide these sorts of improvements and I think it's a shame that systemd-boot will cause regression here.


Speaking from my heart. There is a reason we have convoluted big software like GRUB - to have a single familiar software solution supporting wide range of broken hardware/firmware. Fixing all the firmware is theoretically correct and would be great, but hopelessly unrealistic in the present. We need change of culture in HW and FW vendors before we can say "that will be fixed by the firmware guys".


>Also it would be nice to be able to interact with the boot loader on a modern laptop display without having to get out a magnifying glass. Another problem that is deigned to be the fault of the firmware.

> I think it's a shame that systemd-boot will cause regression here.

If you use grub2, yes you need the magnifying glass.

Meanwhile, I just press F12 and use the BIOS (which selects a readable font) and pick the UEFI entry I want.


Can't edit kernel parameters from the firmware's own boot selector though can you. Or see it on an external monitor (at least on the laptop I'm using right now, which normally lives hidden out of sight attached to a dock).

My recollection was that GRUB when in graphical mode looks readable. Maybe I'm misremembering. The systemd-boot menu was definitely far too small and there was bug that was basically closed with "it's the firmware's job to give us a usable text output protocol".

Effectively, as shitty as the legacy PC-BIOS boot protocol was, it did at least allow open source software to jump in early and work around the gaping deficiencies in the vendor's firmware. UEFI has effectively replaced all that with a proprietary boot process. The vendor is now in complete control; want a usable text console, or to access the firmware via an external monitor? Vendor doesn't give a shit. Want to access the boot loader via the serial console? Oh you must be an enterprise, let me direct you to our sales staff who won't give you the time of day unless you've got £thousands to drop into their pocket.

Don't get me wrong, I'd love to replace GRUB with systemd-boot, but these are very real impediments to me ever doing so.


> Can't edit kernel parameters from the firmware's own boot selector though can you.

Yes you could, if the firmware did let you change the Unicode string (some firmwares let you do crazy things)

> UEFI has effectively replaced all that with a proprietary boot process. The vendor is now in complete control; want a usable text console, or to access the firmware via an external monitor? Vendor doesn't give a shit. Want to access the boot loader via the serial console? Oh you must be an enterprise, let me direct you to our sales staff who won't give you the time of day unless you've got £thousands to drop into their pocket.

You are over dramatizing the current situation. Just boot to a UEFI shell and you can do all that. There are many.

If all you need is to edit the cmdline parameters, add an initrd containing a busybox + kexec to show the currently entered parameters and either keep booting as is, or edit them and kexec the kernel with the new parameters.

It's something a bit like what zfsbootmenu does to let you select the root filesystem you want, and it's much more powerful than a UEFI shell since you get the wide set of drivers linux suppors, and the even wider set of tools you can run on the commandline.

> Don't get me wrong, I'd love to replace GRUB with systemd-boot, but these are very real impediments to me ever doing so.

The #1 impediment seems to be your lack of desire to reclaim freedom, something easy to do if you spend a few hours learning about UEFI payloads, initrd, kexec etc.

Get in touch by email and I'll help you.


How would someone email you? No email in your profile.


username at outlook dot com


Great to see Lennart back working on what systemd does best: streamlining and cleaning existing grubby (pun intended) parts of Linux.

/boot (and ESP) management always feels hacky at best.


Can you explain what about the boot partition status quo seems hacky, and how this approach "cleans" it?


If you create an efi payload, you remove the need for the /boot partition and for grub2, killing 2 birds with 1 stone.

The payload can be managed by your UEFI bios and efibootmgr, allowing more advanced thigs like signing your payloads with your own keys for a true secureboot instead of the hackish "MOK" that's just using unprotected UEFI variables


Issues with the current setup are pretty well laid out in the article.


The article does not characterize the status quo as hacky, and furthermore does not acknowledge the existence of mainstream setups that do not share the pitfalls of the "typical setup". I.E. the article kind of creates a strawman.


In the article the Linux mount point can be as described regardless of where the Linux boot files & kernels are physically stored in the various partitions & folders. We see this variation in the wild observing the different physical locations among the different distributions.

Or you could put key files in places as yet undescribable. That's a worthwhile advantage for Linux, your choices are less limited.

With legacy BIOS doing the job, it's always been smooth & straightforward to use the DOS and/or NT boot loaders to choose between multiple Windows and Linux installations on different partitions. Even though only one out of a maximum of 4 primary MBR partitions per HDD are capable of being marked bootable at a time, you can still boot from there to as many logical MBR partitions, as you can GPT volumes under UEFI. Either way HDD space is the only limitation, UEFI offers no advantage here; when Microsoft orignally claimed it did they were obviously lying.

As it stands now, the Microsoft UEFI boot loader will still not boot Linux, so it is not yet as advanced as the Microsoft BIOS boot loader was 20 years ago.

Therefore to multiboot Windows and Linux by UEFI, you now need to use a Linux boot loader for Windows, when you never needed to do that before, and that basically means Grub2.

Too bad Syslinux 6.04 was badly fragmented in its misguided rush to attempt UEFI without deep co-operation with the actual Syslinux Project itself. So the various 6.04's are somewhat of a shitshow even though there are some nice features that hopefully will condense into stability someday when UEFI becomes truly well-handled. One can only hope.

Syslinux 6.03 has been highly stable and very feature complete since 2014 and no urgent work should be expected, but that is only for FAT32 BIOS booting not UEFI. Along with Isolinux of course for CD/DVD booting, and Pxelinux for netbooting.

Anyway Grub2 has some forks that will read Syslinux config files if desired. This can be good if you want to stick with the tradition of manually creating and editing your own text config file from better documented references the Syslinux and Grub1 way. Compared to the Grub2 approach which highly discourages touching the config file and autogenerates it occasionally instead in a somewhat mysterious way which is much more poorly documented.

But the stable Grub2 is just fine with its own kind of config file, and now works great on UEFI for multibooting Windows.

Anyway, when Windows 7 first appeared the HDD was still structured with a single NTFS partition spanning the whole HDD, with a boot loader file in the root of the volume along with a BOOT folder containing the BCD, just like Vista. XP has a couple boot files and BOOT.INI in the root with no need for a separate boot folder.

So the boot files and folders were in the same drive volume that the Windows folder was installed to up until that point but this was never a required layout. This was recognized by multibooters since additional Windows installations made to different partitions required no additional boot loaders of their own. Further Windows installations to the same PC simply add an additional boot entry to the default bootmenu in the Windows config file (BOOT.INI or BCD), triggering the menu to display upon startup when there is more than one entry.

A number of months after the first release of Windows 7, the boot files & folder disappeared from the Windows volume and began to be located in a small (often 100MB) hidden FAT32 first partition preceding the following second NTFS partition spanning the remainder of the HDD. Although OEM layouts often had third or fourth partitions somewhat hidden following the main Windows volume for built-in recovery and/or restoration, when you installed Windows from scratch you would not have these low-GB final partitions unless you took the effort yourself.

Or in my case leave some large final partitions for various NTFS, EXTx and FAT32 formatting later.

All you ever do then is add entries to an already functional bootmenu in a hidden volume.

Plain to see. /s

Not enough people saw it, but this was when there was finally way more space on the HDD than Windows needed, leaving plenty for Linux. Or other Windows versions, VMs will only get you so far. Judicious partitioning was as good or better than having a second HDD dedicated to Linux.

PC started out with a Windows bootloader/bootmenu and with MBR/BIOS you can just add enties to boot Linuxes which are installed onto their EXTx partitions if you wanted. Or you could switch to Grub on the hidden volume and accomplish the same thing, or Syslinux since it was FAT32.

But a regular Windows PC would boot to a Linux partition(s) from its original Windows bootmenu, and you had complete manual control of the config file from Windows (or Linux).

Once the UEFI system descended from this structure to depend on a similar but more obscure FAT32 ESP volume containing boot files, preferably the first partition on the HDD, it cemented the sensibilty to continue adding new boot entries to the default structure, rather than restructure, except in very particular situations.


Well, no, you don't chain-load Windows from grub either because then PCR are wrong and Bit Locker rightly complains.

Which is why you... Just use the system boot manager to pick which loader to load. Easy, problem solved.

And you can edit the system boot manager entries from Linux and Windows, set bootnext from either one. It's really strictly superior to what you're describing.


How to make one boot partition to rule them all (debian, secure boot disabled for UEFI due to weird bug with how files are laid out with removable flag):

  parted -s "${diskpath}" mklabel gpt
  parted -s "${diskpath}" mkpart primary 1MiB 2MiB
  parted -s "${diskpath}" set 1 bios_grub on
  parted -s "${diskpath}" mkpart primary 2Mib 202MiB
  parted -s "${diskpath}" set 2 esp on
  sleep 1
  mkfs.fat -F 32 -n "boot" "${diskpath}2"
  mount "/dev/disk/by-label/boot" "/boot"
  grub-install --target=i386-pc "${diskpath}"
  grub-install --target=x86_64-efi --efi-directory=/boot --removable --no-uefi-secure-boot
  grub-mkconfig > "/boot/grub/grub.cfg"
This makes two partitions, one for GRUB to inject legacy BIOS boot code into and one for the ESP. ESP gets mounted to /boot, grub gets setup to support both. Only bug is Debian complains about symlinking .bak files for initrd, no biggie.

(This is part of a larger Debian imaging script I made)


> grub-mkconfig > "/boot/grub/grub.cfg"

Shell redirection is fine, but you could just use `grub-mkconfig -o` to set the output file.


Why? One more flag to remember whereas redirection works for any program.


To my understanding the shell will truncate the redirect target even if the command fails.

Using -o makes failure nondestructive.


Mmm I would argue that using -o may make failures non-destructive at the discretion of the author of the program. I don’t know what the program does. Does it safely and securely create a temporary file, then write the contents to it, then rename the file? This would need to happen on the same file system as the target file for the rename to be atomic and there are few guarantees that my current user has the permissions to write to any file other than this one. It also would certainly fail if the file system is full.

Or the program could start out by deleting my file and starting to line buffer write to it as it executes. Then the failure would be the same as shell redirection. Or it could buffer differently making program failure even more fun.

My point is that I don’t know what the program will do, or even what this particular version will do, without looking at its code.

Shell redirection has predictable semantics. And I can output to a different file, run diff on the two, then rename myself, if this is truly critical. And muscle memory will be faster than looking up the specific option or argument for this specific program.


It doesn't matter which end of the egg.


It does when one is a standard convention compatible with all programs whereas -o does different things for different programs. For example, grep -o foo will print only the parts of the input stream that match "foo" but grep > foo will write to the file foo. Some commands don't even have a -o flag like "cat" but output redirection is still always an option.


Careful, not always -- if they ever work with a pager they need to be able to detect redirection


The symlinking .bat files for initrd ended up completely bricking my Ubuntu installation a few months ago, so... take heed? I am unimpressed by the quality and robustness of that whole stuff, though, but ended up writing essentially this article into my notes for the next Arch installation I make, where I think there's a fair chance that EFI will live in /efi but be symlinked in /boot.

There's a whole other problem, though, of my wanting to use secure boot. Ubuntu was the easy out for that. At the moment, it's just disabled on my machines, which is far from optimal.


Secure boot is super easy when using a UKI.

I have a setup that's almost like what Poettering proposes. Only that I don't use any external bootloader as it seems completely unnecessary.

All you need for secure boot in such a setup is signing the UKI with your keys.

That's the simplest boot setup ever!

It's only one file, so no moving parts. It just works. No LiLo and config, no grub and config, not systemd-boot, no nothing. Just the signed UKI on the EFI partition, and a efibootmgr entry pointing to that single file. That's all needed to boot a modern system.


> This makes two partitions, one for GRUB to inject legacy BIOS boot code into and one for the ESP.

Why do you still need the legacy BIOS boot logic?


To provide more to the "why", although end-user devices have moved away from BIOS-only boot, there are large number of systems that have no alternative to BIOS, and the hardware cannot be upgraded, only the firmware and software.

Most of the systems I have ran into were in SCADA[0], PMS[1], and BMS[2].

[0]:https://en.wikipedia.org/wiki/SCADA

[1]:https://en.wikipedia.org/wiki/Power_management_system

[2]:https://en.wikipedia.org/wiki/Building_management_system


For BIOS boot, the BIOS looks at the first couple of blocks on a hard drive for the boot code. This is the first GPT partition that gets created, and the future grub-install code injects the BIOS bootloader there. Thus, to support BIOS and UEFI, you need the BIOS bootloader at the beginning of the drive.


Yes, but why would you want the BIOS boot assuming you have a motherboard made in last 10 years and it has a UEFI implementation which isn't completely broken.

I might understand doing that just in case when preparing a bootable flash drive. Why complicate things for permanent installations where you know your current hardware and your next system after 5 years is unlikely to much worse than current one?


You wouldn't, this is more for cloud/virtual images where some providers support UEFI but most still only support BIOS.


  parted -s "${diskpath}" mklabel gpt
  parted -s "${diskpath}" mkpart primary 1MiB 2MiB
Isn't "primary" wrong here? It assumes your `mklabel` command used label-type of msdos or dvh. You are using gpt though. In case of gpt "primary" is a label, not a partition type.

You can use a descriptive string here instead of "primary" (e.g. "boot partition", or "esp partition").

https://manpages.debian.org/unstable/parted/parted.8.en.html...


My humble boot installer, no explicit bootloader, straight to the kernel:

    #!/bin/bash
    set -ueo pipefail

    # Remount EFI partition read/write and restore to readonly when done

    trap 'mount /sys/firmware/efi/efivars/ -o ro,remount &>/dev/null || true' EXIT
    mount /sys/firmware/efi/efivars/ -o rw,remount &>/dev/null || true

    # Remove all existing Arch Linux entries

    efibootmgr | grep 'Arch Linux' | grep -Po 'Boot\K\d+' | while read -r bn; do
      efibootmgr --delete-bootnum -b "$bn" &> /dev/null
    done || true

    # Install boot entry

    efibootmgr --verbose \
      --create --disk /dev/disk/by-id/nvme-abcdef --part 1 --label "Arch Linux" \
      --loader /vmlinuz-${_linux} \
      --unicode "initrd=\\intel-ucode.img initrd=\\initramfs-linux.img OTHER-KERNEL-BOOT-PARAMS"
EDIT: added initrd boot params


If you want to add an initrd, create an EFI payload: https://wiki.archlinux.org/title/Unified_kernel_image#Manual...

$ stub_line=$(objdump -h "/usr/lib/systemd/boot/efi/linuxx64.efi.stub" | tail -2 | head -1)

$ stub_size=0x$(echo "$stub_line" | awk '{print $3}')

$ stub_offs=0x$(echo "$stub_line" | awk '{print $4}')

$ osrel_offs=$((stub_size + stub_offs))

$ cmdline_offs=$((osrel_offs + $(stat -c%s "/usr/lib/os-release")))

$ splash_offs=$((cmdline_offs + $(stat -c%s "/etc/kernel/cmdline")))

$ linux_offs=$((splash_offs + $(stat -c%s "/usr/share/systemd/bootctl/splash-arch.bmp")))

$ initrd_offs=$((linux_offs + $(stat -c%s "vmlinuz-file")))

$ objcopy \

    --add-section .osrel="/usr/lib/os-release" --change-section-vma .osrel=$(printf 0x%x $osrel_offs) \

    --add-section .cmdline="/etc/kernel/cmdline" \

    --change-section-vma .cmdline=$(printf 0x%x $cmdline_offs) \

    --add-section .splash="/usr/share/systemd/bootctl/splash-arch.bmp" \

    --change-section-vma .splash=$(printf 0x%x $splash_offs) \

    --add-section .linux="vmlinuz-file" \

    --change-section-vma .linux=$(printf 0x%x $linux_offs) \

    --add-section .initrd="initrd-file" \

    --change-section-vma .initrd=$(printf 0x%x $initrd_offs) \

    "/usr/lib/systemd/boot/efi/linuxx64.efi.stub" "linux.efi"
The resulting linux.efi" can be added directly with efibootmgr, and contains the kernel boot parameters (cmdline)


uh - you just specify the location of your initramfs in the kernel boot params and that's it, no need for all the above


You hadn't specified it at first, so I thought it might be helpful to provide a more complete example with different parts (like the initrd) and offsets, with a gummiboot stub


Having kernel and initrd separate makes things more complicated and brittle.

Also a secure boot setup is much more difficult this way.

I for my part love the UKI. Never had a simpler boot setup!


> a secure boot setup is much more difficult this way.

Is it? Don't you just sign the bootable kernel image that already has the initrd and command-line built in?

Oh, I guess if you're using Microsoft as a CA I can see why that would be tricky.


I think this is a misunderstanding. I've said secure boot is much more difficult in case kernel and initrd are separate.

In case of a UKI it's very simple of course. Just sign the boot image.

That's why I love the UKI. :-)


So is this the first work Microsoft set Pottering to do? Kinda supports my personal conspiracy theory that Microsoft's aiming to make secure boot only possible with systemd-boot.

Or Microsoft is trying to make dual booting easier without using the simplest solution of making a larger ESP the default.


So what's your "conspiracy" here? What is secure boot-only? Secure boot with systemd-boot has been possible for years. Just that the typical setup has not been very secure because the initrd and the kernel command line where already unsigned.

I see no problems with secure boot from a freedom perspective as long as the owner of the computer can install their own trusted public keys. There might be industry players that would like to remove that possibility. But why Poettering's work would make that easier I fail to see.

Do you think once a real secure Linux boot is possible they will remove machine owner rights and you have to buy a signed Linux from Microsoft? Or from a Microsoft/IBM/? consortium super monopoly?


> Secure boot with systemd-boot has been possible for years.

It was not, ironically because of Microsoft. Shim is by policy effectively only allowed to boot grub2. So systemd-boot can't be used OOTB on secure boot enabled systems, you'd have to enroll your own key.


> It was not,

It has been possible for years. I have used it for 4 years myself and I was not the first adopter.

> Shim is by policy

I am not talking about shim, you don't have to use it.

Here's how to do it:

1. You erase the whole key database from UEFI (That possibility was some agreement between antitrust authorities and Intel and/or Microsoft that such possibility must be provided because Wintel was in a a dominating marketing position. On Arm devices it is often not possible, Windows / ARM is not in a dominating position and antitrust was not relevant.)

2. You generate your own key pairs.

3. The public ones you install into the secure boot database of UEFI.

4. You sign your UEFI application with your private key.

I have done that with `systemd-boot` and with the Linux kernel (containing the UEFI stub). Works in both cases. I used the instructions from https://wiki.gentoo.org/wiki/User:Sakaki/Sakaki%27s_EFI_Inst... 4 years ago and still do it the same way today.


On some Lenovo machines (x86) deleting the key database can brick the machine. So while you can technically do it in practice you can't.


Do you have a reference?

Deleting the key database can not be done programmatically (unless UEFI has a bug). In all machines I have looked it's an option in the BIOS. So that should be clear case of warranty repair. Which of course does not help you if you do it after warranty has ended.



Thanks for the link.

I guess it remains unclear whether it was a firmware bug that has since been corrected or whether it depends on how exactly the user installs their own keys.

The reply the UEFI itself would be signed and if you delete the matching keys from the relevant DB UEFI would no longer start does not sound right to me.

Good the see that the option exists for AMD, too. I guess AMD had no dominating market share when secure boot was introduced. So they would probably not be legally obliged to provide it? Hopefully market power of those requiring independence of Microsoft is big enough to keep it that way.


This sounds worrisome. Could you explain more? How can this brick a machine?


Some pieces of firmware on the machines are signed by the same keys in the secure boot database. Deleting the keys ends up blacklisting the firmware and so now the machine can't start up correctly because it no longer trusts hardware it needs to work.


I think they will better integrate systemd-boot with secure boot and eventually revoke access to the current shim signer, in essence forcing major distros to adopt systemd-boot.

I can't exactly figure out how they'd use that level of control over the Linux boot, but I'm not a huge fan of them having it.


I find it best to keep in mind that an IBM-compatible PC is designed foremost to run Microsoft OS's which are much less flexible. Major distributions already conform well to the default way that recent Windows versions structure a brand new blank HDD from scratch, as well as the various derived ways PC vendors prepare their OEM hardware which is supplied with Windows pre-installed.

The better that Linux conforms to this range of layout structures, the more overall effectiveness you will have when multibooting, especially the massive multibooting possibilities unique to the UEFI boot method.

You're not going to figure this out on only one manufacturer's hardware, and not over a single-year period.

100MB boot partition is not enough, 500MB either. My first FAT32 partition is always the full 32GB that regular FAT32 is made for, but it's the same HDD structure. I've got plenty of space.

In UEFI I like to put my Linux kernels all on the FAT32 volume where I can handle them along with my Grub files entirely from Windows if I want. I carefully engage the Grub menu autoconfig so my custom bootmenu containing my up-to-date kernel choices is not lost or mangled.

Fundamentally in different Windows versions the particular Windows kernels are stored on the different partitions, and the bootloader just picks a partition. With Linux the different distributions are installed to their own separate EXT partitions, but are not strictly tied to the exact kernel they were issued with. When you update the kernel you can just copy it to the FAT32 volume, then put it on the bootmenu to replace the previous one, or as an alternative. Whether or not you want to update the entire remainder of any distro.

Each Grub bootmenu entry specifies which kernel will be loaded first from which folder on which partition, followed by which full distribution will be loaded next from the same or a different partition.

But this is the forbidden ESP partition, U may not want to go there but EFI already is.

To tame the beast you're going to have to change its identity (temporarily) so you can edit it willy-nilly.

Change or edit its partition GUID designation from ESP to Basic and a proper mainboard UEFI will still reboot by finding the EFI folder on the FAT32 volume as before, but in Windows you will now be able to assign a letter volume to the ordinary FAT32 partition and see the EFI folder there.

If it doesn't boot EFI without the strict ESP GUID in place, you would have to edit it "offline" from externally booted media.

As a very basic bare-metal layout all you need on a FAT32 volume is an EFI folder containing a BOOT folder (EFI\BOOT) containing a bootx64.efi file and it's off to the races.

But bootx64.efi's are squirrely beasts.

It's almost like no two are alike, god I hope these are real code and not generated.

Each OS install can have its own predictable way of overwriting the previous bootx64.efi with its own, sometimes more predictable than others.

A Windows bootx64.efi will then look for an \EFI\Microsoft folder, drill down to EFI\Microsoft\Boot\BCD and run the Windows bootloader

A Ubuntu bootx64.efi will look for \EFI\ubuntu and run the Grub found there

Each OS installed adds its own specific \EFI\XXX folder to the ESP and they are all independent.

But by convention the latest installed OS takes over as the bootloader going forward, simply by overwriting bootx64.efi with its own, and you may need to rely on an autodetect process to update your new bootmenu in its corresponding \EFI\XXX folder, to list all OS's that were previously installed to their partitions.

Otherwise you may need to fill out the config file bootmenu entries custom or manually, or switch back to one of the previous bootloaders which update Grub to include additional distros more ideally.

You will see that different distros sometimes have Grub fully in the ESP, which can include the kernels or not, and others where a good number of the Grub files are on their EXT volume, but mounted interestingly after you have booted. When kernels are only present by default on the EXT volume, I leave them there but place a copy into the working ESP folder(s) additionally.

>Boot Partition Discovery

>The traditional boot partition was not recognizable by looking just at the partition table. On MBR systems it was directly referenced from the boot sector of the disk

Actually not so. The partition table is only the final few bytes of sector 0, and whichever of the 4 possible primary partitions is currently marked for bootability has always been clear to see. It will be the one with the 80h, all others are supposed to have 00 as a boot flag.

Therefore you can (I advise) have duplicate or alternative boot files on each primary legacy partiton, each of which can boot its own OS, or boot to any of the other primary or logical volumes from the flagged partition's own unique (or identical) bootmenu. You choose which primary partition's boot files will be "Active" by placement of the boot flag, and it reboots accordingly. All other boot files remain dormant on their non-flagged partitions.

Except for the partition table at the end, the bulk of sector 0 is the Master Boot Record itself for that HDD, which is the code that runs to find a primary partiton marked Active. Control then transfers to the active partition's first sector which is supposed to be a Volume Boot Record commonly known as a boot sector.

A DOS bootsector then seeks IO.SYS on that volume in a filesystem it understands, an NT5 bootsector would seek NTLDR, and an NT6 bootsector seeks BOOTMGR from there before parsing any multiboot menus contained in config.sys, boot.ini, or BCD, respectively.

In Linux a Syslinux bootsector seeks ldlinux.sys, Grub seeks grub.cfg.

>ESP can be auto-discovered from the partition table, traditional boot partition cannot.

Nope, not true, this is the same misconception as above.


I don't understand why you replied to my post with this, it seems to have nothing to do with my reply.


I'm not sold on using automount to reduce the time spent with the filesystems mounted. Unless I've missed something, having a filesystem mounted doesn't make it any more susceptible to damage; being "mounted" just means that the kernel populates its data structures in memory and adds it to the VFS, it doesn't incur any ongoing r/w access. What risks corruption is writing data... which this doesn't stop, because the moment anything tries to access it the OS will helpfully mount the filesystem again.


In fact, I was bitten once by corruption because the _unmount_ operation was interrupted mid-write. Not that surprising considering it's a much less tested scenario on fs code.


Okay, that hadn't occurred to me. I did wonder if mounting was a problem, if VFAT has "last mounted time" or other metadata that gets written per-mount.


Even VFAT has to clear the dirty flag when you unmount it; other filesystems will similarly write the superblock or do something with the journal.


If things were set up to only mount the filesystems while they're being modified to update the kernel, I can see the value in that. I'm guessing that's not being proposed here, though, because it's too much friction to change the current boot system scripts, and automount doesn't incur the same friction?


Even then, what would it change? The only time those filesystems are being written to is for a new kernel, a new bootloader, or a bootloader config change. In every one of those cases, the filesystem still has to be mounted, so I'm not seeing what the benefit is of keeping things unmounted right up until you're going to write something. (Basically, I can't seem to see any version of this that reduces the actual total writes to the filesystem.)


This is excellent!

Over the years, I've been pleased to see that more and more distributions are writing their disk images and the like to the ESP. (Previously, dd'd USB images for distro installing _required_ the creation of a /boot partition)

The logical next step would be to standardize everything through systemd, and ensure all boot images are autodiscoverable and automatically bootable.

It's been somewhat frustrating for distributions to install GRUB, hijacking the previous prioritized boot PE, and have entries for other installed Linux distributions missing.


> Over the years, I've been pleased to see that more and more distributions are writing their disk images and the like to the ESP. (Previously, dd'd USB images for distro installing _required_ the creation of a /boot partition)

Same. The boot partition is a relic of the past, and you can easily do without it when you remove other relics like grub2: simply make EFI payloads the Arch way: https://wiki.archlinux.org/title/Unified_kernel_image#Manual...

> The logical next step would be to standardize everything through systemd, and ensure all boot images are autodiscoverable and automatically bootable.

As much as I like systemd, I think it's not necessary here: I have a EFISP that's about 16G: it also contains a few .ISO that can be loaded in case system maintenance or a full reinstall is needed. GrubFM and others like Ventoy let you boot directly on say Windows11-22H2.ISO, Ubuntu22-10.iso etc.

When I buy a new drive or computer, I just copy this partition: it's much faster than playing with thumbdrives or PXE Boot.


One downside to storing these images on the (FAT) EFI System Partition is it is not possible to make it part of a software RAID device, so if the device the ESP is on dies suddenly, the boot fails AND it can take quite some time and effort to recover even if you're technically familiar with the boot process since there are no standardised procedures for duplicating the ESP or its content.

A second issue is that since FAT doesn't have permissions any user ID on the system can potentially mount, read, and possibly write to the ESP.

A separate (possibly encrypted) /boot/ file-system can be hosted on a RAID device or even loaded over the network.

With a 'regular' boot process (UEFI > ESP boot-loader core > LUKS unlock > /boot/ boot-loader 2nd stage > /boot/ kernel+initrd > root-fs) the only thing lost is the boot-loader core, which can quickly be switched for a USB boot, set options appropriately to find and use /boot/, and for the boot process to use alternate RAID device(s) and continue with a degraded boot.

On UEFI most distros now do not use os-prober to populate GRUB's own menu with other bootable OSes (GRUB_DISABLE_OS_PROBER=yes) since it expects those OSes to have added themselves to the UEFI boot menu.

As for replacing the default boot entry, this is caused by efibootmgr not GRUB. GRUB calls efibootmgr with "--create|-c" which adds a new BOOTNUM and puts it at the head of the list.

GRUB could be taught a new option that then does several efibootmgr operations: parsing text output of "efibootmgr" to get current boot order, --delete-bootorder then --bootorder A,B,C,OS to ensure OS is added last.

That is obviously fragile so best approach would be to teach efibootmgr a new option --create-append|-C that adds the new entry to the end of BootOrder and then simply teach GRUB a new variable GRUB_UEFI_BOOTORDER_FIRST=yes|no to use efibootmgr --create-append.


As someone who has run /boot on raid1 for years, I think his idea in the article of keeping them synchronised in userspace at update time (ie rsync) is better. There is no reason to need realtime raid1 kernelspace sync when /boot changes only monthly.


One good reason is that mdraid takes care of syncing automatically, no need to set up rsync on every update of /boot.


> One downside to storing these images on the (FAT) EFI System Partition is it is not possible to make it part of a software RAID device, so if the device the ESP is on dies suddenly, the boot fails

It is possible to make ESP part of MDRAID array of level 1 with metadata v1. True, the boot starts with single disk and only later Linux activates the array. So if your boot fails before that happens, reboot and try the other disk.

Or alternatively, don't set up MDRAID array for ESP, just have the ESP mirrored on all disks.


"don't set up MDRAID array for ESP, just have the ESP mirrored on all disks. "

Isn't that the entire point of utilising MD RAID-1 in the first place?

I've done as you suggest in years gone by with either metadata v0.9 or v1.0 (not v1.1 or v1.2) where the metadata is not at the start of the block device and therefore does not confuse a non-MDRAID-aware firmware or other OS.


> I've done as you suggest in years gone by with either metadata v0.9 or v1.0

Then why did you wrote

> it is not possible to make it part of a software RAID device

You are saying it is both not possible and that you've done it.


It is NOT possible to have the UEFI access the RAID device so that if one of the mirrors is broken it can continue to boot in degraded mode.

It requires manual intervention of a knowledgable person to boot a UEFI system into the default OS if the ESP is broken or missing.


OK now I get what you mean. In that case, one can reboot and select the other non-broken disk and the system will boot, with some hiccups due to degraded array, but it will boot.


> through systemd, and ensure all boot images are autodiscoverable and automatically bootable.

See systemd-boot and BootLoaderSpec, both mentioned in OP.

https://www.freedesktop.org/wiki/Software/systemd/systemd-bo... https://systemd.io/BOOT_LOADER_SPECIFICATION/


yes - not all Linux distros do this


"The Boot partition will also have to carry an emtpy "efi" directory that can be used as the inner mount point, and serves no other purpose."

You could substitute Boot for Root in this sentence and flip it around on Poettering.


The root partition already contains a set of empty directories, and Lennart has been working on reducing those where possible (see usr-merge).


It just feels like such a small thing that is not even worth mentioning or taking into account.


I prefer to think that Poettering doesn't know mkdir exists. Eagerly awaiting systemd-make-directory-automatic@.path.


Meanwhile the apple firmware cannot read vFAT ESP. Apple wants ESP to be HFS.


Out of curiosity, how recent is this? I haven't owned my own macbook since the pre-touchbar era (I think the last model I had was the early 2015 Pro), and I had heard that Linux had gotten harder to boot since then due to newer firmware (I remember hearing for a time the newer models were yet to get wifi support on Linux, although I don't know how long that lasted), but at least up until then I was able to use a fairly typical Linux setup dual booted alongside MacOS. I remember using a blogpost I found as a reference, but I think the order of the steps I used were turning off FileVault, shrinking the main MacOS partition by the amount I wanted to use for Linux, doing the Linux install the way I normally do (one FAT ESP partition with refind installed, then the rest as a LUKS volume with LVM root and swap), turn off SIP by booting the macbook into recovery, booting into MacOS and running the `bless` command to set the refind partition to boot by default, rebooting back into recovery to turn SIP back on, and then finally booting back into MacOS and turning FileVault back on. Essentially, by temporarily turning off SIP and FileVault, I was able to get Linux booted by default with my usual LUKS/LVM setup but also have the option to select MacOS from my refind menu and have that booted with the usual FileVault/SIP protections. Based on what I'd read about the efforts to support Linux on the new ARM macbooks and Apple seemingly not going out of their way to block this, I would have thought that this method would still work, although maybe there's something I'm missing.


There is no UEFI on M1 macs. However boot protocol there is completely different and complicated.


Oh wow, that is a big change! Even though I know Microsoft influenced and pushed the UEFI standard, my first macbook was my first introduction to EFI booting (since I had used legacy BIOS on my PCs up until college), so I've always associated it with Apple hardware.


M1 Macs have more in common with iPhones than with Intel Macs.

See https://github.com/AsahiLinux/docs/wiki/SW%3ABoot for example.


Some snarky people would even say that modern Macs are in fact iPhones, just with a different form factor.

The unification of the operating systems for both devices continues with every macOS release. (And that's not only about the unification of the naming scheme…)


It absolutely can. Fedora uses HFS+ for the ESP on Macs because it integrates more cleanly into the Apple boot menu you get when you hold down command on boot, but the firmware handles FAT just fine.


Interesting, when did that start? Our ancient iMac (2009?) with Ubuntu Mate on has a vfat ESP.


Icky as HFS may be, anything is better than VFAT.


I don’t really see the point in using the ESP for anything serious. Many of the arguments are also super weak, like the one about /boot/efi/ being nested (in how many cases is this actually important to anyone and anything?). The ESP size issue successfully prevents the real world adoption of this, since Linux kernels are 50-100 MB each, which means you could maybe fit one on your average ESP, and good luck convincing real users to reinstall Windows just to make some Linux guys happy.

Instead, I would prefer a different approach. An approach that can be seen in Windows and macOS, that is: no user serviceable parts inside. It would work like this:

* Keep two partitions.

* ESP contains a simple program (let’s call it stub), whose only job is to call the real boot loader on /boot.

* The stub is simple (minimum user interaction) and doesn’t need updating very often. It has drivers/support software for storage media and some reasonable file systems (VFAT, Ext4). It may also support some simple form of disk encryption if desired.

* The real bootloader (as well as any kernels) live on the separate /boot partition. The bootloader can do all the fancy things it wants, it can display a fancy wallpaper, support mouse input, and so on.

* The ESP is not auto-mounted, might not even be listed in /etc/fstab.

* Whenever the stub is updated (which would happen rarely, since it’s meant to be simple and minimum), some post-install scripts would mount ESP in any location they please (be it /run/{uuid.uuid4()}), copy over the new stub, and immediately unmount it.

Simpler, safer, will make it harder for rogue software or `rm -rf /*` to mess up the booting of the system, and will not require any changes to existing partition tables.


Your "different approach" sounds almost exactly like the approach proposed in the fine article for the "ESP too small" case (a case which you assume as a given).


The "stub" would be in this case a (striped down) Linux kernel and an initrd… (Because that's needed for proper storage support)

Than this "init Linux" would boot the bootloader, which in turn would load another kernel and initrd?

This looks way to complicated.

How does secure boot work in this setup?

It's much simpler to just "glue together" a kernel and initrd, put it on the EFI partition, and boot that UKI directly by the built in EFI boot-manager.

The EFI partition needs in this setup also be only mounted when updating / adding a UKI. Otherwise it never needs to show up in the running system as it's only used by the EFI boot-manager. Preparing a UKI can be done on the root FS (in its local /boot dir).


Why is initrd needed for "real storage support"?

I usually have my storage drivers compiled into kernel, not as modules (because why have them as modules if you need them always).


Answering question for ' Why is initrd needed for "real storage support"?' because there always be somebody:

* need to have sshd at initrd stage to enter encryption keys (remote boot)

* need to have recovery tool on hands (going even further: imagine having working graphical browser when storage fails)

* have same crazy storage setup - root on ceph/nfs volume with non trial configuration and etc...

* in case of electricity fault at home - it only system which have exposed connection to internet

My suggestion for booting would be following:

1) having UEFI bootloader which boot some minimal kernel (lets same sort LTS for stability’s sake ) on /boot partition

2) have ability to unlock and read from root filesystem (same as above, rootfs in nfs/ceph/encrypted ir whatever)

3) then scan kernels from /lib/modules/{version}/bzImage (i know there are modules here, but having one additional file would not hurt too much)

4a) kexec kernel with necessary parameters to start system [in this step we do not enter passwords, encryption keys twice!]

4b) or boot without kexec in recovery mode with some GUI (I want ability to have browser in recovery mode)


You can compile cryptsetup into the kernel? I didn't know that. Interesting. Have you a reference for that?

Or are you trying to say that you're booting without a initrd (which is said to be in general broken for decades).

Also the secure boot question remains.

I still don't get why complicate things in such way.

Imho nothing is simpler than copying a (signed) boot-image onto the (firmware manged) boot partition, and be done.


Is cruptsetup standard nowadays, or something that one would call a "real storage support"?

Yes I'm booting without initrd, because in my case I don't need it - no cryptsetup and I don't need optional drivers because my setup doesn't change the hardware. (e.g. general distro kernel has to have it because they add modules for every hardware there is to support each user).

How is boot without initrd broken? I use it since my first kernel compilation and find it simpler than dealing with initrd.


Almost everything makes sense, imho.

I have actually almost such a setup since some time.

The only thing I don't understand: Why add something like `systemd-boot` to the setup? It's completely unnecessary! All you need is an UKI on the EFI partition. UEFI has a perfectly sufficient bootloader already. I never found out what additional advantages `systemd-boot` would offer.

Maybe someone could clarify?


>All you need is an UKI on the EFI partition. UEFI has a perfectly sufficient bootloader already. I never found out what additional advantages `systemd-boot` would offer.

It gives you a UI to choose what to boot and edit the kernel bootline. You don't need it if your UEFI firmware makes it easy for you to do that, or if you have edk2-shell available, but those are not true of all systems.


Sure you get a boot menu. But the UEFI bootloader also shows a boot menu if requested. I think this feature is universal.

Editing the kernel command line on the other hand is something you never need except for debugging or recovery. For that you would have anyway an extra UKI installed, with a recovery system in the initrd.

But in normal operation you never ever even see the boot menu.

I still miss to see what vital advantages `systemd-boot` would offer. (And this has nothing to do with systemd as such. I use most of its modules. I just didn't find any compelling reason to use the boot module also. Some very simple hook script that triggers efibootmgr when needed is imho perfectly sufficient).


>But the UEFI bootloader also shows a boot menu if requested. I think this feature is universal.

There is no "the UEFI bootloader" and there is no "universal" boot menu. The firmware only has to boot EFI applications according to the EFI vars.


I've never seen a PC without a UEFI boot menu. That's why I've assumed that this is universal.

Who builds PCs without that feature?

Edit: I've had a look at the spec. Indeed a boot menu isn't mandatory (except when it is, for network boot options, see `PXE_BOOT_MENU`).

> UEFI specifies only the NVRAM variables used in selecting boot options. UEFI leaves the implementation of the menu system as value added implementation space.

But it seems to be common to implement this, as the spec also says:

> If the boot via Boot#### returns with a status of EFI_SUCCESS, platform firmware supports boot manager menu, and if firmware is configured to boot in an interactive mode, the boot manager will stop processing the BootOrder variable and present a boot manager menu to the user.

There are at least a dozen more references to a boot menu in the spec.


I set a 10 second timeout for systemd-boot so I don't need to button mash to hit the narrow window of the UEFI bootloader.

Another useful feature would be editing kernel parameters for troubleshooting. Not sure if systemd-boot can do that but GRUB can.



> For example, it’s probably worth mentioning that some distributions decided to put kernels onto the root file system of the OS itself. For this setup to work the boot loader itself [sic!] must implement a non-trivial part of the storage stack.

IIRC, older bootloaders like LILO used a simpler approach: after each kernel update, a userspace program asked the kernel for the list of sectors which contained the kernel file, and wrote the list to a map file; the bootloader then read that map file (its sector hardcoded into the bootloader by the same userspace program), and loaded the kernel by reading the sectors directly. Neither the bootloader nor its userspace installer needed to know anything about filesystems or other parts of the storage stack, and it worked perfectly with RAID 1.


That only works with simple filesystems, though; it'll fall apart if your root filesystem uses, say, compression, encryption, or possibly any RAID except mirroring (depending on the details and what the bootloader can handle).


> LILO

Ah, that stirs memories. That 1,024 something boundary the kernel image had to reside in, hence the need for a separate /boot partition. And every time there was a new kernel available, I had to re-run some command to recreate that map file.

Things have become a lot more convenient since then.


Pardon the snark, but I'm old.

And more complex at the same time. Now I need to do a silly dance to register a GUID with my BIOS so it can find an executable and run it for me (you did remember to add efivars to your kernel, right?), hold a vFAT partition around specifically for this case (what package has mkvfat again?), and worry about how the system maintainers will decide to "improve" what they think is wrong with this hierarchy.

Previously.. the BIOS would find a particular sector on what I told it was the "boot drive" and run it without too much concern over what, if anything, happened next.

And for this change, do I get a BIOS that's more helpful when there's a frustrating boot misconfiguration? A log to show me what precisely went wrong? The ability to save debug information in any relevant format? I could, but zero vendors in the consumer space are going to do that.

Now.. essentially, I just don't have to specify a "boot drive" anymore. Other than that, if you're bringing a system up from scratch, EFI has just as much "janky magic" in it that the old boot sector method did.


You are correct, the good old BIOS was not pretty, but it worked and was simple enough one could understand it and its common failure modes. All the complexity we keep piling up is going to come back and bite us eventually.

I was just speaking of LILO vs grub, though. (I heard that grub is not pretty internally, but as a mere user, it has not given me trouble.)


Even Windows has a separate, NTFS boot partition these days. Fail to see the point of this, and since the main take basically is "put your /boot inside the FAT ESP, or if not possible, make /boot a FAT partition", it's also bound to create a lot of disagreement.


Windows does not have a separate NTFS boot partition. MSR is not it (check the size and content). Windows Boot Manager and BCD are stored on the EFI partition. Windows Boot Manager itself does understand NTFS: it loads winload.efi, ntoskrnl.exe and core drivers from the system root itself. This way, Windows is not going to have the common linux problem ("update failed, /boot too small").

Two partitions are needed only for Bitlocker; one has to be unencrypted.

Similarly with Apple: they use APFS subvolume for boot files. They do not bother with multiple partitions and static allocations, guessing, what size is going to be OK. They can use as much or as little space as they need.

--

With Linux, I've been using btrfs subvolume for /boot. It works with "normal" distributions, grub complains (it cannot write there; I find that OK). The dynamic nature of the space used is great. It doesn't work with ostree-based distributions (Fedora Silverblue & its ilk); ostree cannot generate proper BLS and grub.cfg for subvolumes.


I'm talking about the WinRE partition, which is required to boot Bitlocker encrypted boot partition (and Bitlocker is enabled by default). Enabling Bitlocker without one results in an error message, and Windows happily recreates/resizes the WinRE partition on every (OS) upgrade by simply reducing the size of the main partition. It has been a long time that the size of the ESP is not enough for all the stuff that Windows wants to do on preboot.

For the record, and showing again the unfairness of the entire MS monopoly situation, most commercial UEFI implementations out there happen to understand NTFS. This allows e.g. a Windows pendrive to boot no matter how the user formats it.


Does it? On my Windows 11 install, the EFI partition is still FAT32. There are no other partitions than the C and the recovery partition.

Am I missing something?


The commenter is saying that there is an NTFS boot partition that is chained after the ESP. So UEFI mounts and execs whatever is in the (vFAT) ESP, and then that ESP bootloader loads data from the (NTFS) boot partition.


MSR (that another partition) has no role in Windows boot. Windows will work without it being present at all.


Windows has been setting up Reserved partitions with boot code for some time now - even (or especially) on systems without EFI


The OP is about a separate boot partition, which is normally where the kernel and associated data (on Linux, an initramfs, obviously Windows would differ a bit).

The "Reserved" partition on Windows machines isn't really a boot partition, for any meaningful definition of it. It's just … reserved, and MS being MS. On my machine, it's empty (unformatted, all 0s). It is lightly documented here: https://learn.microsoft.com/en-us/windows-hardware/manufactu...

(I'd expect your typical GPT Windows install to have about four partitions: the ESP, the empty "reserved" partition, a recovery partition, and the main NTFS partition.)


> The "Reserved" partition on Windows machines isn't really a boot partition, for any meaningful definition of it. It's just … reserved, and MS being MS. On my machine, it's empty (unformatted, all 0s). It is lightly documented here: https://learn.microsoft.com/en-us/windows-hardware/manufactu...

IIRC, that "reserved" partition is to allow converting the data partition which follows it to a "dynamic disk" (which AFAIK, is Microsoft's equivalent to a Linux LVM PV). That conversion needs to grow the partition backwards to prepend some headers, and that extra space comes from shrinking the reserved partition just before it.


On various BIOS-based systems, the reserved partition would contain files necessary for booting windows from its system volume, bridging the gap between what could be accessed by simplistic MBR boot code, the NTFS boot code block, and the NTFS-understanding, ARC emulating (for NT5) or EFI-emulating (NT6) boot system that would load target system.

Details on whether reserved partition would be created and what would be on it depend on hardware you're installing on, and if separate boot partition is necessary windows installer would inform you about need to create an extra partition.


I yearn for a world where hard disk partitions are a memory of primitive times past confined to the minds of retrocomputing enthusiasts. I want my system board to have a firmware that can detect common variants of volume management including LVM and zpools, perhaps that thing Windows has that's kinda like software RAID for people into that kind of thing, perhaps even stuff like OpenBSD's softraid stuff.

1. Find physical disks that may be a part of a managed volume group/pool 2. Find logical volumes and file systems on the volume pools 3. Mount a filesystem by some configurable logic 4. Load and execute a kernel from the filesystem

Having to allocate the first "blocks" on the imaginary "sectors" of my SSD (or even worse, a virtual disk drive) for some arbitrary amount of space formatted in possibly the most barebones filesystem still in mainstream use feels quaint and irritating and limits my ability to use that disk in a larger storage pool.

UEFI is an overcomplicated specification with lots of wintel baggage, but most of it doesn't personally offend me. What does is that UEFI had the chance to abolish disk partitioning, but instead enshrined it. And added mandatory FAT32 to add insult to injury.

My main laptop and desktop each have a separate disk for the EFI system partition. The former uses systemd-boot and the latter ZFSBootMenu. This way I have a maximum of one partition per disk. It's not ideal, but I like it better than the usual solution.

The disks in my zpool show up as having partitions 1 and 9, but I consider that an implementation detail since I never need to treat the disks as anything other than entire disks in a pool


There are many years since I no longer create partitions on any SSD or HDD, because I believe that this serves no useful purpose and it just wastes a part of the SSD/HDD.

I format directly the raw unpartitioned SSD/HDD with a file system that uses 100% of the capacity, with no wasted sectors. At least on Linux and FreeBSD, there is no need of partitions.

For booting the computers, I either boot them from Ethernet or I boot them from a small USB memory that uses a FAT file system for storing the OS kernel, either in the format required by UEFI booting, or, when booting Linux in legacy BIOS mode, together with syslinux, which loads the kernel.


So your solution to not use partitions is to use multiple disks? You do understand that people invented partitions precisely because they wanted to use a single disk, right?

I am glad this setup works for you, but many people will not want to need a USB drive to boot their desktop, laptop, tablet, phone, et cetera.


For a desktop, this can be transparent for the user, because it can be booted via Ethernet or from an USB drive that is attached all the time to the computer, possibly on one of the internal USB type A connectors that exist on many motherboards, precisely for this purpose.

Using a single disk with multiple partitions is less convenient than using a separate boot drive, because the separation makes easier the reuse of both the boot drives and of the root drives in other computers, or their copying onto drives of different sizes, when migrating or cloning operating systems.

For a laptop that contains encrypted SSDs/HDDs, using an USB drive to boot it, which is not normally kept with the laptop, can be used for improved security, because only in this case no secret keys need to be stored on the encrypted drive. Even if the secret keys are also encrypted, they must be encrypted with a key derived from a password, which can make their decryption much easier than the decryption of a drive encrypted with a random key.


> For booting the computers, I either boot them from Ethernet or I boot them from a small USB memory that uses a FAT file system for storing the OS kernel, either in the format required by UEFI booting, or, when booting Linux in legacy BIOS mode, together with syslinux, which loads the kernel.

That certainly works, but I'm pretty sure that moving booting off of your main disk is the only reason you can go without partitions, and I'm also pretty sure that most people don't want to deal with that.


You're talking about saving _at most_ 200MBish. That's a lot of work to maintain for little gain...


There is less work, not more work.


This sounds like work to me

> For booting the computers, I either boot them from Ethernet or I boot them from a small USB memory that uses a FAT file system for storing the OS kernel, either in the format required by UEFI booting, or, when booting Linux in legacy BIOS mode, together with syslinux, which loads the kernel.

Creating boot USB drives (which I think need partitions don't they?) or setting up a PXE boot server would take me a lot more effort than an extra minute with gdisk to create partitions before formatting the disk.


If the USB drives were bought formatted as FAT, which is always true for those smaller than 32 GB, they already have the required partition.

For booting with UEFI, you just need to create the directories with the names expected by the firmware. For legacy booting, you just need to install syslinux, which takes a second.

Then the USB drive can be used to boot any computer, without any other work, for many years.

When you change the kernel, you just mount the USB drive (which is not mounted otherwise), then you copy the new kernel to the USB drive (possibly together with an initrd file), renaming it during the copy, you unmount the USB drive and that is all.

You can keep around a few USB drives with different kernel versions, and if an update does not go well, you just replace the USB drive with one having an older version.

Configuring a DHCP/TFTP server for Ethernet booting is done only once.

Adding extra computers may need a directory copy in the directory of the TFTP server only when the new computers have a different hardware that requires different OS kernels.

Updating a kernel requires just a file copy towards the directory of the TFTP server, replacing the old kernel.

None of these operations requires more work than when using a boot partition on the root device.

There is less work because you make booting USB drives or a DHCP/TFTP server only once for many years or even decades, while you need to partition the SSD/HDD whenever you buy a new one that will be used as the root device.


Yeah I don't see it. I don't really have any investment in how you boot your machines, but I can't see this being anything but significantly more work than just using the tooling that's already there for you. When I buy I computer I set it up once and then it lasts 3-6 years. Even setting the system you've got up once would likely take me more time that I've spent adding partitions to disks in the last 20 years. Heck that's probably true even if you include all the servers I've administered in that time as well as my personal machines, especially since those all ran Ubuntu or RedHat where the installer just does it for me, vs my personal Arch machines.

I partition a new computer once every few years. I upgrade the kernel a few times a month. With the normal way that's a simple `pacman -Syu` or `apt get dist-upgrade` and it's handled, no mounting thumb drives or sftp needed.


It's amazing what lengths people go to to justify their convictions and not realize the silliness. You've just described a convoluted setup with many drives and computers on network and claim there is no more work.

That is more work for everybody except if they wanted a) completely encrypted main disk and booting from portable guarded USB, or b) set of computers in school or internet cafe with boot process managed/updated efficiently via network.

This system does provide new special capabilities, but it is not for free. Meanwhile, most users are happy with defaults like 1G partition with bootloader and kernel, which allows easy updates without worrying about having the right USB drive, being in the right port, being mounted in the right path, or losing it.


When I install a new OS I typically just use the guided installer which makes the partitions automatically. This is usually the default too. I would actually have to go out of my way to set it up so that the drive is a single partition then on top of that create a USB drive that I'd need to always have on hand which sounds like a tremendous PITA with a laptop.

If it it works for you great but that is a LOT of extra work to regain less than 1% of the storage on a drive.


That is risky, since without a partition table, some operating systems and disk management tools will treat the disk as empty, making it easy to accidentally overwrite data.

> 100% of the capacity, with no wasted sectors.

You will never have that. SSDs have a large amount of reserved space, and even on HDDs, there are some reserved tracks for defect management.


By "some operating systems and disk management tools" you mean MS Windows and Windows tools.

Obviously, I do not use unpartitioned SSDs/HDDs with Windows. On the other hand, With Linux and *BSD systems they work perfectly fine, regardless whether they are internal or removable.

For interchange with Windows, I use only USB drives or SSDs that are partitioned and formatted as exFAT. On the unpartitioned SSDs/HDDs I use file systems like XFS, UFS or ZFS, which could not be used with Windows anyway.

Any SSD/HDD that uses non-Windows file systems should never be inserted in a Windows computer, even when it is partitioned. When a SSD/HDD is partitioned, it may be hoped that Windows will not alter a partition marked as type 0x83 (Linux), but Windows might still destroy the partition table and the boot sector of a FAT partition. It happens frequently that a bootable Linux USB drive is damaged when it is inserted in a Windows computer, so the boot loader must be reinstalled. So partitioning an USB drive or SSD does not protect them from Windows.

>> 100% of the capacity, with no wasted sectors. > You will never have that.

I thought that it is obvious that I have meant 100% of the capacity available for users, because there is no way to access the extra storage used by the drive controller and also no reason to want to access that, because it has a different purpose than storing user data, so your "correction" is pointless.


Also some dumb firmware may write to such disks, Asrock boards were reported in past to do that.

Efi+boot partitions usually take less than 2G of space, and can be made like 200MBs total, while mainstream disk capacity is hundreds of GBs nowadays.

This "loss of useful space" is immaterial in most cases. Maybe if you have something like a 2GB drive from 1990s that you want to use (why?) then it makes sense to shave off 1G off that. But it is more work, as you have to buy, prepare and manage the USB drive.


There are more reasons to set it up more or less like that.

Think of an expensive, super fast but considerably small SSD, and some cheap big mass storage (maybe even on spinning rust) along that.

You'll likely try to use the expensive SSD as efficiently as possible. Every GB counts if you have "only", say, 0.5 TB.

A boot partition on such expensive and small (but fast) media is pure wastage.

Also this kind of setup seems not so uncommon, as I can claim that I've done something similar. :-)

There are even more reasons. It makes things even more simple and less error prone:

The argument that you can swap disks more easy was mentioned already. But that's not everything one gains.

SSDs are very prone to get worn out much quicker and loose at least half of their performance when you mess up the data alignment on them. In case of FDE with partitions (maybe on top of LVM even) the alignment issues isn't trivial. It's quite easy to mess up the alignment by mistake. You can read a lot of docs, try to find out details about the chips on your SSD, do calculations, yada yada, or you just encrypt the raw device and use the whole disk without partitions. That's considerably simpler, nothing can go wrong.


> Every GB counts if you have "only", say, 0.5 TB.

It doesn't. You can make the overhead partition take 200MB. That's immaterial fraction of 0.5TB. You ain't gonna see impact of this loss. Additionally, by partitioning the drive, you protect it from dumb programs who like to create partition tables.

Yes, there are reasons for not partitioning your OS disk, like full disk encryption. But it is more work.

> when you mess up the data alignment on them. In case of FDE with partitions (maybe on top of LVM even) the alignment issues isn't trivial.

This sounds interesting. What are these alignment issues? Why do you think they are present on disk with partitions (I never had those issues) and why do you think they are not present on disk without partitions (may be they are, due to compression/encryption)?


In case you would use anyway only one partition (because boot is elsewhere) not having any partitions at all is not more but less work.

Alignment issues are only really relevant in case of SSDs. The FS blocks need to align with the "physical" blocks of the chips used. (Actually this are also "only logical blocks", presented to you by the SSD controller, but at least this is fully transparent). If the alignment is messed up the SSD needs to consider at least 2 "physical" blocks (as presented by the controller to the OS) when accessing a single FS block. This leads to doubling the wear and halves the performance. (At least, in really unhappy scenarios this can even triple the access effort).

Where exactly a FS block starts and ends in relation to the underlying "physical" block(s) depends on all the "headers" that are "in front" of the FS blocks (or logically sometimes "a layer up", even "physically" this also only means "in front"). Partition tables are headers. LUKS headers are obviously also headers that need to be taken into account. LVM headers (and blocks, groups, volumes) are even one more layer to consider.

To make things more fun, like said, the "physical" blocks are only an abstraction presented by the controller. In some cases their size is configurable through the SDD controller firmware. (But this shouldn't be done without looking at the chips themself). The more interesting part is: The "physical" blocks can have "funny" sizes… (Something with some factor of 3 for example). Documentation on this is frankly spare…

The usual tools just assume some values that "work most of the time". But this whole problematic is actually quite new. Older version of all the related tools didn't know anything about SSD block alignment. (Like I said, they still don't know anything for sure, there is not way to know without looking a the docs and specs of the concrete device, but now at least they try to guess some "safe values"; with a large margin).

If you use partitions you'll end up with those "funny" few MiBs large offsets, which you have seen for sure. (If you don't use offsets it's very likely that the alignment is wrong).

Without partitions the other storage layers are much easier to align. You don't need to waste a few MiBs around your partitions, and especially don't need to remember (and maybe even recalculate) this stuff when changing something.

Not many people know about this whole dance as misalignment isn't a fatal problem. It will just kill your SSD much quicker, and half the performance (at least). But SSDs are so fast that most people would not notice without doing benchmarks… (Benchmarks of the different storage layers is actually the only way to test whether you got the alignment right).

If you don't look into this yourself you can only pray that all tools used were aware of this issues and guessed some values that work by chance properly with your hardware. But if you created partitions without the "safe" offsets (usually by setting values yourself and not letting the tool chose its "best guess") the alignment is quite likely wrong.

I'm came across this issue because I was wondering why Windows' fdisk always added seemingly "random" offsets around partitions it created. It turns out it's a safety measure. Newer Unix tools will do the same when using proposed defaults.

TL;DR: If you don't create a partition table on a NVM device you can just start your block layer directly on block zero and don't have to care about much as long as you also set the logical block size of that layer to the exact same value as the (probably firmware configurable) "physical" block size of the device. If you have a (GPT) partition table in front (which is by the way of varying size to make things even more funny) you need to add "safety offsets" to your partitions. Otherwise you're torturing your NVM device, resulting in servery crippled performance and lifetime.

I hope further details are now easy to google in case anybody likes to know more about this issue.

---

> Additionally, by partitioning the drive, you protect it from dumb programs who like to create partition tables.

The better protection would be do keep drives far away form operating systems and their tools that are known to randomly shred data… ;-)


Thanks for the effort, but this is not very convincing. Is there any documented case where physical blocks have size that in bytes is not some power of 2? I suspect if that exists, it is quite a rare device. Blocks of size 512B, 4K, 8K are the most common case, and correct alignment is completely taken care of by the 1MiB offset which is standard and default in fdisk and similar tools on Linux. You mention "random" offsets with newer Unix tools - I have never encountered this. Any examples?


> Thanks for the effort, but this is not very convincing.

I've written this to shed light on the alignment issue as I was under the impression that this would be be something completely new to you. ("This sounds interesting. What are these alignment issues?")

> Is there any documented case where physical blocks have size that in bytes is not some power of 2?

Yes, there are examples online. I did not make this up!

It was in fact some major WTF as I came across it…

> I suspect if that exists, it is quite a rare device.

Jop, that for sure.

Also the documentation is very spare on this, like already mentioned.

I think it was the early triple-cell chips that had such crazy layouts. (Did not look it up again; maybe this was only a temporary quirk; but maybe it still exists, no clue).

> Blocks of size 512B, 4K, 8K are the most common case, and correct alignment is completely taken care of by the 1MiB offset which is standard and default in fdisk and similar tools on Linux.

Well, it depends.

This thingies I've read about with some factor of 3 in their block size would need at least a 1.5 MiB offset… (And the default 1 MiB offset would torture them to a quicker death; but most people would likely never find out).

There are devices with much bigger (optimal) block sizes, I think in the MiB ballpark (don't remember the details out of the top of my head, would need to look it up again myself). Also in such cases the 1 MiB would not suffice.

Those devices are usually in some compatibility mode in factory settings, with much smaller blocks than optimal for maximal performance and least wear. You need to tell the firmware explicitly to switch the block size to get best results (which is of course not possible after the fact as it obviously shreds all data on the device).

Also it's not only the offset around the partitions. You need to take the block sizes into account also regarding the block layers "inside" the partitions. Which was actually my point: This makes things more complicated than strictly needed.

> You mention "random" offsets with newer Unix tools - I have never encountered this. Any examples?

By "random" I've meant that the offsets appear seemingly random when you don't know the underlying issue. It's not only the one offset after the partition table. Depending how large the partitions are there may be or may not be additional offsets around the partitions themself.

Of course all this is not rocket science. We're talking about simple calculations. But that's just one more thing to be aware of.

My conclusion form that journey back than was: Just screw it, and don't add partitions to the mix if you don't strictly need them. One thing less to care about!

For example the laptop I'm just writing on has two NVM devices. The bigger and faster one is used as (encrypted) root FS, without any partitions on it, the other smaller and slower one carries the EFI partition and an (encrypted) data partition. If I would have have partitions on the root disk this would not give me any advantages, but additional stuff to think about. So why should I do that? OTOH I need an EFI partition to boot. So I have created one on the other disk. I think this is a very pragmatic solution. Just don't add anything that you don't need. <insert Saint-Exupéry quote about perfection here>


Alright that makes sense.


I have taken this approach for secondary drives where I want to use the entire drive as a big filesystem for data.

For the system disk I have always partitioned it though. I generally create at least /, /var, /home, and /usr. That way it's less likely that a runaway process can fill up the entire disk, at worst it might fill up /home or /var.

And unless I'm really space-constrained, I'll leave some unpartitioned space as well, for later flexibility.


That is an excellent application of partitions but it's better done with LVM so you can change the partitions' sizes easily. You should be able to install LVM on the whole disk.


You can still do BIOS boot on disks without partitions. One huge advantage of "legacy" boot is that it can work filesystem-agnostic, avoiding and secondary FS implementations in the firmware or in the bootloader.

And if you go that far, you can throw out any FS kmod from the initrd except for what you need for your root partition. Including vfat.


This blog post talks about systemd a lot. Why would I want to use systemd to set up boot partitions? I don't even have it installed on Linux distribution.


It's a P.R. effort to prepare the "community" for more bad and harmful solutions from Mr. Poettering and his employer Microsoft. It does not solve any real problem users have.


> I personally believe that making use of features in the boot file systems that the firmware environment cannot really make sense of is very clearly not advisable. The UEFI file system APIs know no symlinks, and what is SELinux to UEFI anyway? Moreover, putting more than the absolute minimum of simple data files into such file systems immediately raises questions about how to authenticate them comprehensively (including all fancy metadata) cryptographically on use

It makes all the sense if you want the firmware to load your bootloader binary and then get lost.

Firmware is not to be trusted to read, write, interpret, analyze or send your boot file system elsewhere. It's a closed proprietary part of your hardware ffs and unless that changes, it should be kept in dark.

When there is hardware with free software firmware we can fix, then we can talk about accomodating firmware vendor needs.


EFI must be a separate partition, right. But why should /boot? Mine is just a directory in the root filesystem.


It does not always have to, especially if your rootfs is simple. But if your rootfs is on LVM on MDRAID... then it's best to have separate simple boot fs.


> Consider removing any mention of ESP/XBOOTLDR from /etc/fstab, and just let systemd-gpt-auto-generator do its thing.

TIL! My NixOS configuration just got a little bit simpler, and more uniform between machines.


As they say the more things change, the more they stay the same.

Back in the day Netware would boot off a small DOS partition located at the front of the boot disk, and was started from autoexec.bat. Blew my mind at the time.


> In a trusted boot world, the two file systems for the ESP and the /boot/ partition should be considered untrusted: any code or essential data read from them must be authenticated cryptographically before use.

In the real world, my ESP and boot partition are trusted, since I've installed them and control them and can check them for malware, and the firmware is not trusted since there is no possibility of control, it's vendor blobs.

Mr. Poettering is clearly arguing for interests of the secure boot/DRM complex, not competent owners/users.


How often do you check your unauthenticated ESP and /boot for malware? Only when you suspect malware is already there? If so, it's too late. Locking them down as suggested helps to prevent the issue altogether. Yes, firmware should not be trusted, but reducing attack surfaces is a good thing.

As far as your last point, the vast majority of people aren't competent computer owners or users. They don't know what to do on a technical level with malware in their boot process.


> How often do you check your unauthenticated ESP and /boot for malware?

I don't, because it's not something I so far cared about. But if I started caring about these sorts of attacks, I would not rely on board firmware to check software on my disks:)

> vast majority of people aren't competent computer owners or users.

Vast majority of people don't read HN and are not interested in the boot process. Those that do are more relevant, and I suspect many of them don't welcome Mr. Poettering's agenda.

> They don't know what to do on a technical level with malware in their boot process.

I'm not suggesting they should. But Mr. Poettering does not discuss origins and possible ways to solve this problem, e.g. using free software from reliable sources only and auditing software and hardware we use. He presents his (an the secure boot complex's) preferred solution to a problem that is complex and probably will try to "gently push it" on distributions like he did with systemd (his words).


I still use MBR. No need for extra boot partitions or all that bloated UEFI jazz. MBR on GPT disks works just fine in compatibility mode.


Indeed. UEFI brings "solutions" to problems we don't have, but corporations would like those solutions for some reason.


Hmm, I wonder what to do if you want to be able to boot the same system via both BIOS and UEFI.


you can have both MBR and GPT on the same disk. GRUB explicitly supports it too, if it finds a EE00 partition type, it will parse the disk again as GPT and boot normally to boot a second stage of the ESP. A UEFI system will ignore the MBR and use the GPT table to boot directly into the ESP. The Gentoo wiki has a description how to set this up.


I’m curious why that would be useful?


Live images. Also if you still use BIOS-era hardware you found in the trash because you can't afford to buy hardware, and if someone gave you a UEFI device you can't afford the additional storage devices needed to migrate data and or because you don't want to have to deal with migrating a system or reinstalling it.


>what to do if you want to be able to boot the same system via both BIOS and UEFI.

UEFI is simply supposed to find an EFI folder containing boot files, on a FAT32 partition.

A proper UEFI firmware will find the EFI folder regardless of whether the layout is for MBR or if it is GPT.

So you just use conventional MBR layout, format with FAT32 and throw on a (carefully crafted) EFI folder.

This is how Windows live (startup) media is normally done, so you can boot to either type and install Windows to the PC. On recent Windows versions the install.wim file has crept up in size beyond what FAT32 will support, so NTFS is needed for them.

Linux live distros do real well booting from FAT32 using Syslinux (ntfs is also an advanced option), just like the monolithic ISO's do using Isolinux. But to install the distro on the HDD I want an additional EXTx partition to hold the Linux, and still boot to it from the FAT32 boot volume.


No mention of zfs, btrfs, xfs, lvm, or fde volumes from the makers of excessive-complexity cancers of systemd and pulseaudio.


Didn't you read the article, we're supposed to use vfat only and prepare for firmware checking our disks and revoking our boot rights on keyring and denylist updates.(LOL)


I'm saving this for future


Using ZFS makes this all a lot simpler.


I fail to see how? The ESP partition must be vFAT on GPT, in order for the BIOS to find it. Your BIOS doesn't speak ZFS.

The main partition can be whatever, but that's not typically available until after the kernel & initramfs are loaded. (As it is typically initramfs that does the prompt for the password, to decrypt it.)


ZFS is the "crypto solves this" of filesystems.

Adding out of tree ZFS to the boot mix sounds hella complicated.


Interestingly, GRUB actually supports ZFS; it has the dubious distinction of being the only extant implementation of ZFS that's GPL licensed, but... probably because of that... it's separate from the main OpenZFS implementation is extremely feature-poor. This results in fun things like Ubuntu's root-on-ZFS layout creating 2 pools; a boot pool (bpool) that GRUB can read, and a root pool (rpool) with the OS. It's not that complicated, but it's not nice.


That's because in the mid-oughts Solaris used GRUB for booting.

It's really nice because you get to boot from multiple datasets in the same pool so upgrades and downgrades can be very smooth.


Interesting. I've never really wanted on boot on ZFS, and I definitely don't see the point if I'd need a dedicated pool for it.


It could be really cool; it would let you snapshot your boot filesystem and roll back to a previous configuration.

...I say, as someone who does in fact leave my boot filesystems on VFAT:)


Ubuntu did snapshot the bpool; unfortunately, it did a poor job of garbage collecting the snapshots. Meaning that eventually you would have failing kernel updates due to lack of space, and having to manually clean it up.

Since 22.04, zsys (the tool that did the snapshoting) is not installed by default.


> Since 22.04, zsys (the tool that did the snapshoting) is not installed by default.

Er, are they not snapshotting the root filesystem or ex. /home by default then?


No, you have to install and enable zsys yourself.


It's a lot easier to manage one pool occupying one large partition on all the disks in the pool than it is to manage any number of partitions without pooling.


It isn't really. For instance I have 3 Things installed on my EFI partition.

- Windows

- A Linux kernel with zfs zupport sufficient to boot up and load zfsbootmenu

- rEFInd boot menu

The default target is rEFInd which shows this snazzy menu to choose between Windows and Linux

https://i.redd.it/n51sxvv8xfw61.jpg

If you choose Linux the zfsbootmenu kernel boots up imports available pools and lets you boot your current system or a prior snapshot of same. If you have regular snapshots especially automated ones before upgrades its pretty hard to mess it up.

The whole works is one small partition and is fairly easy to understand


Ubuntu does it.


No, it does not.


As growing the size of an existing ESP is problematic (for example, because there’s no space available immediately after the ESP, or because some low-quality firmware reacts badly to the ESP changing size)

Code quality of the firmware in typical systems is known to not always be great. When relying on the file system driver included in the firmware it’s hence a good idea to limit use to operations that have a better chance to be correctly implemented.

Remind me again why I should cater to people who insist on writing and running terrible code.


Because that's everyone? What computer are you using that has really high quality firmware?


I guess that is fair. Most of my non-Android computers use U-Boot. One is some UEFI implementation. I don't know how it copes with growing the ESP. I don't see why it would freak, though.


Many of us have been burned by making the assumption that UEFI implementations behave sensibly. Most of the pain points have been ironed out by now, yes, but the lesson I've taken away is: don't assume.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: