> There are a number of places where the packet could be dropped, too many to inspect every possible line of Linux kernel code. To find the offending code, we turned on all routing trace points via /sys/kernel/debug/tracing/events/fib/enable.
I wonder if that this is one of many trace points one could use but for another particular component or subsystem in the kernel, if that makes sense? That would be very interesting to know.
It's also great to see that it's upstreamed as an actual contribution as also linked in the article. 
A lot of people I've worked with treat the kernel as if it were the compiler, saying that like it's never the compiler, it's also never the kernel. For me, I've seen so many bugs in Linux that it's the first place I look when anything suspicious starts happening.
However, the NDA compartmentalization issue would cause bugs. For instance, I guess the wifi microcode devs implemented hardware queuing but the intel kernel driver devs at the time didn't know this, so the driver devs just wrote their own queues in software and the users dealt with weird errors for some time because they literally couldn't ask the division that wrote the microcode how the hell it worked. I'm amazed anything with wifi works. For a time around then I just threw up my hands and used OpenBSD's reverse-engineered driver because it was more reliable and simple to understand.
Since that and about half a dozen other wifi kernel issues? I've been kinda fascinated by wifi and wifi stacks in the kerbel, but it appears to be some black magic power not available to the masses
Most single-machine kernel bugs are going to be easier to figure out, once you learn some kernel APIs, of which there aren't so many.
Some of it yeah. Honestly, I spent a lot of time looking at it and figured out how to fix/tweak minor crap.
Main thing I've been working on and off over the years on is CPU scheduler issues. Back when I was a high schooler my music kept skipping whenever my desktop box came under any substantial load. Turns out desktop workloads way different than what sorts of workloads most of the devs are paid to work on. The situation's improved but honestly still sucks with the stock scheduler.
>Currently, I'm facing a wifi kernel bug that keeps spitting out call traces to the system log, that I have absolutely no idea what to do with.
iwlwifi? if you're on 5.6.0 it was shipped broken. I think the fix landed in 5.6.1, so anything newer should work.
Traces are nice. At least you have some pointer where to start with.
1) You can google for the functions in the trace (to see if someone had a similar issue and it's solved already). You can add site:lkml.org to narrow it down to the linux kernel mailing list.
2) You can go to http://lxr.free-electrons.com/ and search for the functions in the trace and look around the code to see what the issue might be. The trace will have some more info about the kind of the issue (WARN, NULL pointer dereference, etc.)
And from there your options branch a lot... :)
Twitter has some great networking/kernel engineers. When I was working at Twitter a few years back we isolated and fixed another insidious kernel bug; a large group was critical to making it happen (including Cong, who worked on this bug): https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp...
I'm always shocked at how the kernel seems to mostly work, with such meagre test coverage. I guess testing in production does kind of work at scale?