
Hunting a Linux kernel bug - luu
https://blog.twitter.com/engineering/en_us/topics/open-source/2020/hunting-a-linux-kernel-bug.html
======
rvz
Very good in-depth analysis and low-level bug hunting. It's always interesting
to see blogposts that give an insight into a bug found in a userland
application only to start digging into the internals of the Linux Kernel. The
technical background in this post really does help setting up the whole
context into understanding the data structures and concepts involved in fixing
this bug.

> There are a number of places where the packet could be dropped, too many to
> inspect every possible line of Linux kernel code. To find the offending
> code, we turned on all routing trace points via
> /sys/kernel/debug/tracing/events/fib/enable.

I wonder if that this is one of many trace points one could use but for
another particular component or subsystem in the kernel, if that makes sense?
That would be very interesting to know.

It's also great to see that it's upstreamed as an actual contribution as also
linked in the article. [0]

Impressive!

[0]
[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=66f8209547cc11d8e139d45cb7c937c1bbcce182)

------
jeffbee
Routing and filtering are areas of the Linux kernel replete with bugs. When
they added iptables coverage to syzkaller, it filed a bunch of issues in the
first day. It's still uncovering them, for example
[https://syzkaller.appspot.com/bug?id=2a96216b5facf781474795f...](https://syzkaller.appspot.com/bug?id=2a96216b5facf781474795fb518880e9ce87363b)

A lot of people I've worked with treat the kernel as if it were the compiler,
saying that like it's never the compiler, it's also never the kernel. For me,
I've seen so many bugs in Linux that it's the first place I look when anything
suspicious starts happening.

------
eatonphil
I didn't know Dan Luu was at Twitter. And I didn't know Twitter had an
engineering blog. Good on them both though, this is a great read.

------
rubatuga
Linux kernel bugs seem to have the greatest required level of expertise to
debug. I've been programming for a few years, and kernel debugging is
definitely not in my toolkit, and I know very few people who would know how.
Currently, I'm facing a wifi kernel bug that keeps spitting out call traces to
the system log, that I have absolutely no idea what to do with.

~~~
nisa
Wifi is even worse than i.e. the bug in the article because often it's some
registers that are off or some firmware bug or even chip bug - so you are
basically helpless without intimate knowledge about the chip in question (only
with NDA if at all). Much respect to anyone hacking on wifi (or anything else)
in the kernel!

~~~
cjbprime
And if you're _lucky_ , then the wifi chip will helpfully corrupt your RAM
with its terrible DMA mismanagement in a way that allows you to notice that
the memory you were expecting to contain your own values instead contains the
ESSID of the hospital across the road from you:
[https://mjg59.dreamwidth.org/11235.html](https://mjg59.dreamwidth.org/11235.html)

------
vijayp
Congrats to the team for tracking this down, it was a great write-up!

Twitter has some great networking/kernel engineers. When I was working at
Twitter a few years back we isolated and fixed another insidious kernel bug; a
large group was critical to making it happen (including Cong, who worked on
this bug): [https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-
tcp...](https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip-data-
to-mesos-kubernetes-docker-containers-4986f88f7a19?gi=ef00c52ee0c2)

I'm always shocked at how the kernel seems to mostly work, with such meagre
test coverage. I guess testing in production does kind of work at scale?

