Hacker News new | past | comments | ask | show | jobs | submit login
Hunting a Linux kernel bug (blog.twitter.com)
155 points by luu 5 months ago | hide | past | favorite | 13 comments

Very good in-depth analysis and low-level bug hunting. It's always interesting to see blogposts that give an insight into a bug found in a userland application only to start digging into the internals of the Linux Kernel. The technical background in this post really does help setting up the whole context into understanding the data structures and concepts involved in fixing this bug.

> There are a number of places where the packet could be dropped, too many to inspect every possible line of Linux kernel code. To find the offending code, we turned on all routing trace points via /sys/kernel/debug/tracing/events/fib/enable.

I wonder if that this is one of many trace points one could use but for another particular component or subsystem in the kernel, if that makes sense? That would be very interesting to know.

It's also great to see that it's upstreamed as an actual contribution as also linked in the article. [0]


[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

Routing and filtering are areas of the Linux kernel replete with bugs. When they added iptables coverage to syzkaller, it filed a bunch of issues in the first day. It's still uncovering them, for example https://syzkaller.appspot.com/bug?id=2a96216b5facf781474795f...

A lot of people I've worked with treat the kernel as if it were the compiler, saying that like it's never the compiler, it's also never the kernel. For me, I've seen so many bugs in Linux that it's the first place I look when anything suspicious starts happening.

I didn't know Dan Luu was at Twitter. And I didn't know Twitter had an engineering blog. Good on them both though, this is a great read.

Linux kernel bugs seem to have the greatest required level of expertise to debug. I've been programming for a few years, and kernel debugging is definitely not in my toolkit, and I know very few people who would know how. Currently, I'm facing a wifi kernel bug that keeps spitting out call traces to the system log, that I have absolutely no idea what to do with.

Wifi is even worse than i.e. the bug in the article because often it's some registers that are off or some firmware bug or even chip bug - so you are basically helpless without intimate knowledge about the chip in question (only with NDA if at all). Much respect to anyone hacking on wifi (or anything else) in the kernel!

And if you're lucky, then the wifi chip will helpfully corrupt your RAM with its terrible DMA mismanagement in a way that allows you to notice that the memory you were expecting to contain your own values instead contains the ESSID of the hospital across the road from you: https://mjg59.dreamwidth.org/11235.html

I remember back in 2005 or so Intel was dicking around with ipw3945 shipping a userspace regulatory blob. Linux kernel devs didn't like this, I guess wouldn't merge the driver. Intel rewrote the driver into iwlwifi (iwl3945) which just loaded a microcode out of the firmware directory.

However, the NDA compartmentalization issue would cause bugs. For instance, I guess the wifi microcode devs implemented hardware queuing but the intel kernel driver devs at the time didn't know this, so the driver devs just wrote their own queues in software and the users dealt with weird errors for some time because they literally couldn't ask the division that wrote the microcode how the hell it worked. I'm amazed anything with wifi works. For a time around then I just threw up my hands and used OpenBSD's reverse-engineered driver because it was more reliable and simple to understand.

"Firmware" in a lot of cases is a full-blown proprietary black box operating system of its own.

I once tried to chase a bug in a wifi driver on NetBSD and soon gave up as I was just completely lost. I also remember buying a book on driver development, hoping there would be some decent info on the 802.11 stack and was mildy amused to find a breif mention at the end of a chapter describing ethernet internet saying "sorry wifi is too complicated to cover".

Since that and about half a dozen other wifi kernel issues? I've been kinda fascinated by wifi and wifi stacks in the kerbel, but it appears to be some black magic power not available to the masses

For what it's worth, a counterpoint from an ex-kernel maintainer who now does web stuff and distributed systems: kernel programming was usually easier for me than your standard full-stack web developer stuff. The API surface of a web stack is so large -- juggling things like a JS framework, CSS, nuances of HTTP sessions, web security, databases, load balancers, etc.

Most single-machine kernel bugs are going to be easier to figure out, once you learn some kernel APIs, of which there aren't so many.

> Linux kernel bugs seem to have the greatest required level of expertise to debug.

Some of it yeah. Honestly, I spent a lot of time looking at it and figured out how to fix/tweak minor crap.

Main thing I've been working on and off over the years on is CPU scheduler issues. Back when I was a high schooler my music kept skipping whenever my desktop box came under any substantial load. Turns out desktop workloads way different than what sorts of workloads most of the devs are paid to work on. The situation's improved but honestly still sucks with the stock scheduler.

>Currently, I'm facing a wifi kernel bug that keeps spitting out call traces to the system log, that I have absolutely no idea what to do with.

iwlwifi? if you're on 5.6.0 it was shipped broken. I think the fix landed in 5.6.1, so anything newer should work.

A lot of the kernel bugs are pretty mundane.

Traces are nice. At least you have some pointer where to start with.

1) You can google for the functions in the trace (to see if someone had a similar issue and it's solved already). You can add site:lkml.org to narrow it down to the linux kernel mailing list.

2) You can go to http://lxr.free-electrons.com/ and search for the functions in the trace and look around the code to see what the issue might be. The trace will have some more info about the kind of the issue (WARN, NULL pointer dereference, etc.)

And from there your options branch a lot... :)

Congrats to the team for tracking this down, it was a great write-up!

Twitter has some great networking/kernel engineers. When I was working at Twitter a few years back we isolated and fixed another insidious kernel bug; a large group was critical to making it happen (including Cong, who worked on this bug): https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp...

I'm always shocked at how the kernel seems to mostly work, with such meagre test coverage. I guess testing in production does kind of work at scale?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact