Hacker News new | past | comments | ask | show | jobs | submit login
Extending the Linux Kernel with Built-In Kernel Headers (linuxjournal.com)
91 points by jrepinc 87 days ago | hide | past | web | favorite | 59 comments

It’s a simple and practical solution. It irks me for some reason but I can’t think of something better.

It's too bad the squashfs idea was dropped. Having to download and unpack the archive somewhere seems like an extra step that should be unnecessary vs just loading the module and having the tree appear in /sys for you to point your compiler directly at.

Also a bummer that it looks like you have to be running the kernel to get access to this— perhaps the archive could be marked off in the binary somehow so that there's a way to access it for non-running kernels (eg, the dkms use case of wanting to build the module for that kernel you just installed ahead of booting it for the first time).

Overall binary size is a legit concern, but if this is going to be in an optionally-loaded module anyway, it seems weird to make that call just based on memory usage. I imagine most distro kernel builds will just disable this, since they already ship the headers in separate packages.

squashfs was dropped because Greg doesn't like squashfs, which I can understand somewhat (it doesn't have an active maintainer really, there's no userspace library, just binaries, etc.).

However, we're doing a project where I work that depends heavily on squashfs, so the fact that it was dropped from this even though it would have been nicer because people don't like it because it doesn't have a maintainer worries me. Hopefully someone picks it up :)

Oh, gosh. Since when it's without a maintainer? I would think it's quite an important piece of most embedded Linux projects. Those companies really should sponsor someone to take care of it.

My personal project also depends on squashfs. Are there any other options for read-only compressed rootfs?

Squashfs-tools are a bit rough around the edges that's for sure. I patched in ability to produce an image without root for my use, but I think it would be generally useful. Sort list file format is funky. It's: "filename_string priority_integer", so it will not work for files with spaces. Also new lines, but that's far less common. It was not yet a problem for me and I can always patch it more and one can say, that people having files with spaces deserve it, though it's entirely different issue.

Btrfs supports compression. A "seed" flag that makes it read-only. If you 'btrfs device add' a rw second device, it can be read-write with all writes directed to the second device, e.g. /dev/zram device for volatile overlay for e.g. a LiveOS boot, or you can add a blank partition and then remove the seed which causes the seed data to be replicated to the rw partition. Plus all metadata and data are checksummed.

zstd support since linux 4.14, and a mount time option for compression level since linux 5.1. So you could get very good compression ratios equivalent to squashfs, but squashfs will still come out slightly ahead because it also compresses its own metadata, where Btrfs doesn't.

My best guess is squashfs is a sufficiently successful project that it's allowed most distributions who depend on it (quite a few) to sit on its laurels.

Looks like he might (?) be coming back: https://github.com/plougher/squashfs-tools/issues/54#issueco... although the release he's talking about didn't happen.

The last commit that wasn't an API refactoring in the kernel tree is a3f94cb99a854fa381fe7fadd97c4f61633717a5, which is from Aug 2, 2018.

It is not without a maintainer. But you make a good point that the companies that use it should sponsor it. I still have to maintain it for free in my spare time, and spare time gets less and less each year.

In fact it woudn't surprise me if the parent commenter is one of the people working for a multi-billion dollar company that got offended when I told them I wouldn't do some work for them, for free. Hence the suspicion there is an axe to grind.

squashfs has horrible performance. All requests to the block layer are 512 Bytes. Other filesystems like ext4 make much bigger requests and perform much better in the end despite the compression of squashfs leading to lower overall data volume.

Disclaimer: Measured 2 years ago on ARM32, emmc, with a 4.1(?) kernel.

Try using the SQUASHFS_4K_DEVBLK_SIZE config option next time

By default Squashfs sets the dev block size (sb_min_blocksize) to 1K or the smallest block size supported by the block device (if larger). This, because blocks are packed together and unaligned in Squashfs, should reduce latency.

This, however, gives poor performance on MTD NAND devices where the optimal I/O size is 4K (even though the devices can support smaller block sizes).

Using a 4K device block size may also improve overall I/O performance for some file access patterns (e.g. sequential accesses of files in filesystem order) on all media.

Setting this option will force Squashfs to use a 4K device block size by default.

If unsure, say N.

I'm quite sure I have tried all options available in the kernel I used back then without achieving performance comparable to ext4. My project manager was conviced that squashfs makes things faster (and so hoped I initially because the overall data volume is smaller) so I had a hard time to convince him that we will just drop that "optimization" from the project plan. (He was one of those who can prefer checkmarks over technical merit.) I don't remember the 4K option for sure, but if it existed, we tried and measured it. What is the size ext4 is reading from the block device? I'm reading this on holidays on my phone, so I cannot easily fire up blktrace. But I could guess it's 128K or even 256K. So still far from 4K.

emmc is not a MTD device. I measured only om eMMC.

A big part of the problem might be xz decompression, that's been my discovery anyway.


I tried various compression algorithms. I don't think there was a CPU bottleneck, at least not with the less aggressive compressions. blktrace showed the difference compared to ext4, squashfs does all reads one by one block.

It was in a previous job. I don't have access to the details anymore. And the kernel was not the newest. But squashfs looked unmaintained already then and that's what they're saying elsewhere in the discussion. So I fear nothing has changed.

I'm the author and maintainer of Squashfs and I can assure you that Squashfs is not unmaintained. Over the last couple of years Squashfs has been stable, without requests for new features, and so work have been mostly only security improvements and bug fixes. But there is a big difference between that and claiming Squashfs is unmaintained.

In fact I don't know why you're claiming it is unmaintained. Got an axe to grind perhaps?

I'm not claiming it's unmaintained. I'm saying that the maintainer is not active. You yourself mention elsewhere in this thread that you have little free time to maintain squashfs, so that seems fair.

No axe to grind; it's a cool project and that's why we decided to use it. Thanks for your work on it, and we'll send patches if/when we have them :)

It irks you because storing source code directly in the final binary is bonkers, but unfortuantely it is the only way to reveal all the complexity that can be exposed in a C header file.

This isn't really just a C problem though. Even rust has a similar problem for propagating macros. The reason for this limitation is really that making an ABI for metaprograms (macros) is exceedingly difficult

Are there other languages that provide interfaces as cleanly as exposing C header files?

java interfaces.

I think the concept of coding to an interface (or an API specification) goes beyond what java did. In my professional life, I see people constantly downplaying and ignoring importance of a formal and stable API specification, just to suffer consequences later.

If you just give me a .jar do I get those interfaces in a consumable form?

Said differently, can I download your .jar and write my own code to interface with it while on a desert island without any other resources?


If you have a random JAR file, an IDE can introspect it to see what classes are in it and what methods are in those classes.

The only exception is if you run it through some kind of obfuscator first.

Yes but JAR files are basically equivalent to source code parsed and serialised as bytecode with only the most rudimentary optimisations applied. I don't really see a difference between that and carrying gzipped headers with the kernel. Yes, the .class format is cleaner, but it's still very close to what the programmer wrote. Another plus to carrying the header with the kernel is that you can carry the preprocessed header to make sure the user doesn't set the wrong defines, and you can add compiler checks to check that the user isn't using a too old or too new compiler with incompatible ABI.

I suppose the real difference is that with java platform, you get

1. 1/10 or 1/100 compile time

2. full introspection in regards to the interfaces (the topic of this discussion). the implementation classes could be obfuscated, but the point of an interface (and an API in general) that it is not.

3. one could possibly allow multiple versions of the interface be present on the same machine at the same time. for example, when SomeInterfaceV1 is replaced (or extended) by SomeInterfaceV2, it might be possible to provide an adapter that publishes the older interface for backward compatibility.

these are just random thoughts, because the difference between current systems written in c/c++ and a new architecture that uses interface-based approach is too great, I don't think I can name a real system that uses it in practice.

done the API versioning myself several times with complex applications though.

Doesn't necessarily work for the kernel, but GIR and typelibs provide machine-readable descriptions of C APIs: https://github.com/GNOME/gobject-introspection

From what I know about it, gobject-introspection has some nice properties, but one killer drawback: it's incompatible with cross compilation. This is apparently because it requires running binaries compiled for the target system as part of the build process. You actually can cross compile if you have an emulator for that system handy [1], but that's horrible.

Apparently the reason it requires running the compiled binaries is that GObject types are only registered at runtime, within so-called "_get_type" functions. For more typical systems, everything needed can be determined at compile time. Too bad there's no portable way to ask a C compiler to dump what it knows about a source file, but if you just want things like struct sizes and field offsets, you can compile a C file that embeds them as global variables, and then extract the variable values. For more advanced introspection there are many less-portable options including Clang's API, GCC-XML, parsing debug info, or writing your own compiler (easier for C; it seems that parts of gobject-introspection work like this).

Anyway, another interesting comparison is DTrace's CTF (Compact Type Format) [2], a simple binary format that describes the kernel's C struct layouts, function signatures, etc. This information is simply converted from compiler-generated debug info [3], but it's stripped down enough that it can be embedded into every kernel without too much size overhead. When the DTrace compiler is invoked to compile a user hook, it parses the CTF data and exposes the types and functions to the user's code (which is written in a custom C-like language).

Ironically, BPF has BTF, which is a very similar-looking format that encodes very similar kinds of data – but is used for a completely different purpose. Specifically, it's only used to encode types and functions defined by BPF programs, to allow the kernel to pretty-print things. But in theory BTF could be repurposed to work like CTF: you would need to generate BTF information for the kernel itself, and then Clang could be extended to support "including" BTF files in place of C headers. However, this option was apparently discussed and rejected [4]. I haven't read the original threads to find out why, but I suspect it might involve:

- Lack of existing tooling to do the above;

- Lower expressivity compared to C headers, e.g. the inability to encode macros (although this could be fixed);

- Desire to use the information for building not just BPF hooks but also full-fledged kernel modules.

[1] https://maxice8.github.io/8-cross-the-gir/

[2] https://github.com/oracle/libdtrace-ctf/blob/master/include/...

[3] https://www.freebsd.org/cgi/man.cgi?query=ctfconvert&sektion...

[4] https://lwn.net/Articles/783832/

Yeah, I'd love to have something like BTF or CTF used widely for machine-readable type information. (https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf... gives some further information there.)

The limitations regarding macros sound like the biggest issue to me (both code-like macros and just simple defined names for values via #define). I'd love to see solutions for that. What do you think that would look like?

Interesting writeup! I didn't realize there was an active attempt to generate BTF for the kernel.

Regarding macros...

Well, to start with, there's the brute-force approach of simply embedding textual macro definitions. That might be good enough for most use cases in practice: as far as I know, most BPF hooks are written in either C or the C-like bpftrace language, so expanding macros as text would probably give a sensible result for the majority of macros that aren't particularly complex. And macro definitions are already included in the DWARF info, so the DWARF-to-BTF approach from your link could be easily extended to embed them.

But it would be nice to describe macros in a more structured format, which could allow use from non-C-like languages and would probably save on file size. Some prior art I'm familiar with is rust-bindgen, which generates Rust bindings for C headers using libclang, and supports translating C macros that expand to constants. Basically it checks each macro that's defined without arguments and uses libclang to try to evaluate it as a C constant expression; this will fail for macros that expand to things other than constant expressions, but it just ignores those. If evaluation succeeds, it translates the macro to a typed Rust constant declaration.

It might be possible to do something similar for BTF. As output format, either add a new 'constant integer' node, or translate such macros as if they were enum definitions. For Linux it would probably be best to avoid a dependency on libclang, but a custom parser might work, or maybe a hackier approach based on feeding things to the C compiler like:

    enum { value_of_SOME_MACRO = ((((((((( SOME_MACRO ))))))))) };
and sorting through the resulting morass of compiler errors :)

Edit: Forgot to mention – functional macros would be nice too, but of course they're much harder to translate. And heck, what about inline functions? Convert them to BPF?

> there's the brute-force approach of simply embedding textual macro definitions. That might be good enough for most use cases in practice

I very much want this for usage from Rust, so that doesn't suffice.

> It might be possible to do something similar for BTF. As output format, either add a new 'constant integer' node

That sounds promising to me, for the common case.

> Edit: Forgot to mention – functional macros would be nice too, but of course they're much harder to translate. And heck, what about inline functions? Convert them to BPF?

In an ideal world, 1) emit a symbol for them so they can be used from any language, albeit not "inline", and 2) compile them to bytecode that LTO can incorporate and optimize, for languages using the same linker.

Neither of those would work for macros designed for especially unusual usages that can't possibly work as functions. (The two most common cases I can think of: macros that accept names and use them as lvalues, and macros that emit partial syntax such as paired macros emitting unmatched braces.) But honestly, flagging those and handling all the common cases via BTF information would still be a huge improvement.

Perhaps we should continue this on an IRLO thread?

Debug symbols, which can be resolved at load time. This would also make ebpf bytecode more or less kernel version independent.

Why /sys and not /proc ? After all, the kernel binary itself is under /proc .

There is a very long thread on LKML discussing this. But I'm pretty sure the primary reason is that procfs has been slowly accumulating more and more random knobs, while sysfs actually has a structure that can be used reasonably by userspace.

The file started out in proc. There was a very long lkml discussion on the subject.

Too bad there's no mention of the new BTF file format, which is supposed to resolve this issue for the eBPF side:


Why can't they build the eBPF bytecode offline using the correct kernel API and ship the bytecode to the Android device?

The problem is that different kernel configurations end up changing the layout of important data structures. Basically, you need to change the offsets that the eBPF uses.

There is another proposal that uses BTF to do the fixup at load (not compile) time, but that ends up having it's own complications.

If it’s anything like GPU shaders, it’s because the bytecode format itself is Turing-complete and not-so-sandboxed, and the actual security/fault-tolerance comes from a static analysis pass done during compilation that ensures the source code being compiled isn’t doing anything crazy.

If you were able to load bytecode directly, you’d skip this verification step.

(I’ve always thought VM runtimes should have signing keys, sign build artifacts—e.g. bytecode—as they create them, and then have the VM’s module loader check the signature. This way, you could still rely on build-time static verification for your security, while also being able to share compiled artifacts among any set of runtimes that trust one-another’s signing keys.)

I think singularity did at least a piece of what you're describing, signing the bartok compiler output after install time and post verification so that it could be saved to disk and retrieved later without the compilation and verification steps.

eBPF is not Turing complete, and from my understanding is made to be sandboxed.

eBPF is specifically not Turing-complete because you can not prove any property of Turing-complete language code.

The bytecode format of eBPF is Turing complete, it's the verification pass that makes sure the code flow graph is a DAG for instance.

cBPF was different, and only allowed you to jump forward (guaranteeing termination), but eBPF is a much more general VM.

eBPF is not Turing-complete, but static analysis can absolutely prove things about individual programs in Turing-complete languages.

You can prove some code in Turing complete languages.

You can construct non-turing-complete languages which all code can be proven, and that is the point the parent is making.

Forbidding loops in a language does make all code written in that language provably terminating. But there’s not much else it lets you prove for all code, at least not if you want to get results before the heat death of the universe. For example, you can’t determine what inputs to a program will yield a given output: even classic BPF is probably expressive enough to implement a cryptographic hash, and eBPF definitely is. You can’t enumerate all possible paths through the program: the number of paths is exponential in the size of the program.

On the other hand, forbidding loops does make some properties easier to prove for restricted classes of programs. For instance, Linux’s BPF verifier tracks, for each instruction, the minimum and maximum value of each register at that point in the program. It uses that to determine whether array accesses in the program are bounds checked, and complain if not: that way, it doesn’t have to insert its own bounds check, which might be redundant. Doing the same in the presence of loops would require a more expensive algorithm, so forbidding them is a benefit. Yet... Linux’s verifier is sound: it will forbid all programs that could possibly index out of bounds, at least barring bugs. But it is not fully precise: it does not pass all programs that have the property of never indexing out of bounds for any input. For example, you could have a program that takes a 256-bit input and indexes out of bounds only if its SHA-256 hash is a specific value. That program is safe iff there happen to be no 256-bit strings that hash to that value, something that you could theoretically verify – but only by going through all 2^256 possible strings and hashing each of them. Linux does not.

Nor would it be reasonable to, why does that hypothetical matter? Because the ability to prove arbitrary properties about all programs, at the cost of arbitrarily long analysis time, is sort of the main mathematical benefit to non-Turing-completeness. But from a practical standpoint that’s useless. And if you don’t need that – if you only care about the ability to prove things about restricted classes of programs – well, you can achieve that even with loops. After all, that’s what a type system does, and there are plenty of Turing-complete languages with type systems. As I said, disallowing loops makes the job easier in some cases. But that’s more of a matter of degree, not the kind of hard theoretical barrier that “non-Turing-complete = analyzable” makes it sound like. That makes it a less convincing rationale for disallowing them.

Of course one can make proofs of programs written in Turing complete languages.

Lol Joel (the author) and I discussed that this week. Sounds like bpftools will need some modifications to reload a precompiled bpf program. We'd like to avoid having another copy of LLVM on device (renderscript has an old version of LLVM).

What's wrong with shipping headers and using DKMS for building like in GNU/Linux?

> embed the kernel headers within the kernel image itself and make it available through the sysfs virtual filesystem (usually mounted at /sys) as a compressed archive file (/sys/kernel/kheaders.tar.xz)


"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."


Why tarred and compressed though? If it wasn't, one could point a compiler directly at it.

It goes into that, at length, in the article.

Why eww?

File this one under: kludges to get around openly user-hostile userland.

How so?

Seems like an elegant enough solution to ensure you always have the right headers to build modules against the currently running kernel.

An elegant solution would be not having that problem in the first place.

The article goes into detail as to why this actually can't be used to build modules against the currently running kernel.

Isn't kernel header availability a solved problem on any linux system that isn't busy pretending not to be built on linux?

The nice thing about this is that you can keep kernel headers around in a much smaller format on-disk, while still making them available separately as a package if you want.

A few problems this solves:

1. Neither user nor tool knows specifically how to install the headers for specifically this version of the kernel.

2. Manually installing a header package on e.g. Ubuntu marks it as manually selected, meaning it doesn't get cleaned up with an 'autoremove'; selecting a generic kernel headers package (like linux-headers-4.14-generic or something) means that every time your system auto-updates the kernel package version (which happens by default on, for example, AWS IIRC, and happens a lot) you end up with yet another copy of the kernel headers, and then you blow your I/O budget uninstalling them.

3. It removes one extra step to running e.g. an eBPF program, or building an e.g. out-of-tree kernel module, so make scripts can now start to take advantage of that. For example:

    # Get headers
    if modprobe kernel_headers; then
      <unpack the tgz>
      <scan through /usr/src, /usr/local/src, etc. for what looks like the right version, or fail>

Did you read the article? Or even the title? How is userland involved in any of this?

Shipping the kernel headers is complicated, but somehow a whole C compiler is not? Or doesnt BPF compilation need a C compiler?

BPF is compiled in user-space and the bytecode is provided to the kernel -- the compiler definitely isn't shipped as part of the kernel.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact