Hacker News new | past | comments | ask | show | jobs | submit login

To me, the biggest part of this story is:

1. Over two years ago, this was apparently detected automatically by the syzkaller kernel fuzzer, and automatically reported on its public mailing list. [1]

2. Over a year and a half ago, it was apparently fixed in the upstream kernel. [2]

3. It was apparently never merged back to various "stable" kernels, leading to the recent CVE. [3]

So you might read that and think "Ok, probably a rare mistake"...

...but instead:

4. This is apparently a _super_ common sequence of events, with kernel vulnerabilities getting lost in the shuffle, or otherwise not backported to "stable" kernels for a variety of reasons like the patch no cleanly longer applies.

Dmitry Vyukov (original author of syzkaller fuzzer that found this 2 years ago) gave a very interesting talk on how frequently this happens a couple weeks ago at the Linux Maintainer's Summit, along with some discussion of how to change kernel dev processes to try to dramatically improve things:

slides: https://linuxplumbersconf.org/event/4/contributions/554/atta...

video: https://youtu.be/a2Nv-KJyqPk?t=5239


[1] https://twitter.com/dvyukov/status/1180195777680986113

[2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux...

[3] https://mobile.twitter.com/grsecurity/status/118005953923380...

The failures of the Linux core team to properly prioritize security is quite well known. A lot of people have poked the bear by trying to bring this up, also with specific real examples, and got a tongue lashing from the team and moved on to other things.

I'm amazed the GRSecurity people have managed to do it for so long. Even if merging their stuff mainline legitimately wasn't practical, I've seen plenty of snark and dismissiveness from the Linux team towards them and others. And GRSEC does actively bring in CVEs into their kernel patches all the time and get paid via sponsors to do so.

I'm sure going through old CVEs is a great way to find "zero days" and/or relapses after old patches. Or even just following the work GRSec does there's probably plenty of stuff for a highly motivated company like NSO to exploit.

>The failures of the Linux core team to properly prioritize security

Why even post this, when it has nothing to do the with the case GP & OP described? It's misleading at best.

The failure here is in the way Google has set its Android development process. They keep a separate "stable" kernel, and manually select certain patches to backport to. In process they skip all kinds of patches - performance, features, and yes, security ones. Given that only selected patches are backported, the process is best described as insecure by default. It was Google's decision to favor stable API over security here.

This is compounded by the fact other Android phone vendors are pretty slow at releasing OS upgrade - and tend to stop releasing them altogether shortly after the phone's no longer manufactured.

The mainline kernel, as released by the Linux core team is up to date with security. Hold to account people that decided to skip patches as a matter of course, resulting in the insecure by default process.

> The mainline kernel, as released by the Linux core team is up to date with security. Hold to account people that decided to skip patches as a matter of course

To me, the particulars of this exact case are not as interesting as the fact that the entire Linux patching and backporting of security issues seems _very_ fragile, with things frequently getting "lost" for mundane reasons, and a key part of the "why" it is so fragile is due to many of the core Linux development processes.

This particular CVE is apparently one small example that happened to catch some headlines out of _thousands_ of similar problems.

That talk linked above by Dmitry Vyukov is worthwhile for getting a sense of the magnitude of the problem.

I think there is a lot to be said about the shortcomings of the Android world in how Linux Kernel updates are trickling down the whole foodchain (or rather: are not). BUT: This particular episode is a very bad choice for your argument. Cause in this case the bugfix in question was never ported back to any regular Linux LTS version, while it actually was cherrypicked for the Android Common Kernel (branches 3.18, 4.4 and 4.9).

Now why that fix never made it in most vendor kernels (besides a few like the one in the Pixel 3 that is based on 4.9) is a good question. But at the same time there is the reality, that everyone focusing on the upstream LTS kernel would have never gotten the fix.

> The failure here is in the way Google has set its Android development process.

I think it's still the won't layer. Google may be able to put some pressure to change things, but it's describe the issue as "the failure is in the way SOC manufacturers have set their kernel porting process". You often get chips which work with one version and a dump of specific drivers. Beyond pressuring the company to upstream their changes, or writing clean room versions, I don't see many solutions.

> failures of the Linux core team

But isn't the issue that Android didn't merge it? Linux patched it.

GRSec is the primary reason the Android devices from BlackBerry have never, to my knowledge, been rooted (despite their many flaws). It's crazy that it's not more accepted.

I personally wouldn't trust a company that openly bragged it built a system to provide local police and Intel agencies with real time access to Blackberry messaging flowing across an entire city in 2010 for G20. In addition to sharing their "master" encryption key for a number of years:


Also AFAIK Blackberry only provided a hardened kernel with a single device in 2015 called Priv. I haven't heard anything from them since... maybe someone could correct me here.

The new Android devices also have hardened kernels but it doesn't really matter phones are insecure as fuck in other ways.

Indeed. Who the hell thought it was a great idea for the modem baseband device to have unlimited direct memory access to the host processor memory space? I mean, especially when the baseband firmware can usually be remotely updated by the network with zero user interaction?!

> have unlimited direct memory access to the host processor memory space

Can you give some reference for that claim?


If you want more, literally google “baseband attack host processor memory” or “baseband exploits DMA” or “baseband exploits memory”.

Is this all Android? Or just Blackberry?

There is no 'Linux core team'

There is one, but they have nothing to do with this.

This falls pretty much on Google and people in charge of the backports

Sure there is, the ones with commit rights to validate pull requests.

That is too large to be considered 'core' and too unstructured to be considered a 'team'.

Hence these issues, arguably.

If you submitted a huge patch to Linux that would really improve support for real-time audio, but break many other things like IO throughput, etc. but your argument for accepting it anyway was "yes, but it's for real-time audio support, so it's really important and more important than what everyone else is working on", you'd be laughed out of the mailing list.

The fact you use the word "prioritize security" is indeed telling. Security is just one aspect of a system. There is no particular reason for Linux to prioritize it above everything else, no matter what twitter infosec drama queens believe.

Obviously the fact that infosec people are quite often insufferable does not help their case.

When you're developing such a crucial part of an operating system, wouldn't you want to put security pretty high at the top of the priorities list?

If I were an average consumer, I would care much more about my device being secure, than having real time audio.

You act as if we don't have an experiment about exactly this going on for decades that says the exact opposite, but we do. It's called OpenBSD. The average consumer doesn't care, because it's less performant in some cases and certain software doesn't work or work as well.

It's not even a choice between better security and real-time audio, since the average consumer doesn't even know about that unless specifically called out by marketing. For phones it's about what looks better, both physically and digitally. It's about how the emojis look, how good the pictures the camera takes look (or how good you're told they look), and how responsive and smooth the screen movements are.

The average consumer goes off what they can immediately see and what they're told by marketing, and by what they feel social pressure to buy. The discerning technical expert goes off marketing (but a different set of claims), and a bit more of a discerning eye, and while far more knowledgeable than the average consumer, is still mostly driven by hearsay.

The number of people with enough knowledge to actually make a real data driven choice is probably much less than 0.0001% of people, and that's far from average. I'm not one of them, but I can look at the systems often affected, make some assumptions about how many people know enough about them to speak usefully on risks they actually have, and do some napkin statistics to know almost nobody else is either, even here.

It's easy to call out the average consumer, but truthfully, the last time you bought a phone or computer, how deeply did you analyze the actual security considerations to do with the different aspects of the system, and how much did you rely on what some site told you, trusted recommendations, what you already preferred, and your hunch was which was better? How many millions of lines of code are involved in these systems now? How could you, or any of us actually do anything other than that?

>If I were an average consumer

If you were, you would behave like one, i.e. not care that much (if at all) about security. What you are saying is "I do care about my device being secure".

Yes, I suppose that's true. I should have worded it differently.

Perhaps, the average person would be more upset/notice if they were negatively impacted as the result of a security issue, than if some feature, e.g. real time audio were missing, which I'm sure no one would even notice.

You’d think no one would notice real time audio being missing on android devices, but that discounts the huge amount of music creation thats done on mobile platforms these days. It’s a way bigger market than you’d think!

Security should indeed come before anything else.

If you're system gets p0wned there is hardly any audio to play.

macOS, iOS and Windows security improvements, while being the musicians choice for real time audio, show it is possible to put security first, while offering a good audio stack.

Security is always in service of something, not the other way around. The highest point for security is coming together with something side by side, but not before it.

That is how languages like C or JavaScript get adoption.

It’s “telling” of what exactly?

Linux distributions make up the majority of public web and database servers and approximately none of the real-time audio players, is that not a “particular reason” to prioritise security over real-time audio?

I'm not saying that Linux should neglect security, but there is no reason for it to be above anything else. People who care about security can look at systems that are more focused, like OpenBSD.

Agreed on fair prioritization, but most people care about Linux security, because that is the major platform. OpenBSD is an obscure platform that is not an option for massive majority of users, and no phone uses it.

The Linux kernel is a glaring example software malfunction due to its combination of moderate defect density and incredible extent, along with a culture intolerant of competence. People who became subsystem maintainers because they happened to be hanging around a mailing list in the 90s are still gatekeepers of important subsystems despite their now-decades-long records of continuous malfeasance. Patches that demonstrably improve the health of the project are rejected if they would reduce the powers of these gatekeepers. We should look at the whole project as a cautionary tale of the kind of leveraged destruction that some programmers of modest ability but extreme confidence can wreak on our industry.

It's bad enough that syzbot finds fifty serious bugs per hour, but I'll relay a personal anecdote. Earlier this year I wagered a colleague that I could open up the source of the 4.10 kernel (the one that was once current in Ubuntu 16) and find an obvious defect in less than an hour. It actually only took me about 15 minutes, to find a deadlock in the squashfs that was triggered by kmalloc failure and an error path via goto, which of course nobody should ever use. And while I'm reading it I'm just thinking to myself that this is the worst program I've ever seen and it would never pass a code review at my workplace, but it's out there right now running on billions of computers.

You’re wrong that no one should ever use goto. Goto is a perfectly fine control flow operator IF AND WHEN you use it in a highly structured, well-understood way. This is how systems programming is done. A “goto cleanup” section at the end of a function is the best way to do exit-on-error in C, hands down.

I hate that people keep peddling this nonsense because they wrote a little C and read a headline about “goto considered harmful”, which is a gross oversimplification and a BAD piece of “wisdom” that for some reason won’t die. This is how serious handling in C is done. Please stop repeating this tired trope.

Goto is only de rigueur in terrible languages that are unable to release function-local resources automatically. It is an intentional choice by kernel authors to ignore all progress in our field after 1988 and insist on writing everything in C. It’s not the goto statements that are the problem, rather it is the culture that necessitates them.

Do you realize how big the kernel source is? No one wants to rewrite it to C++. Also C++ compiles slower and can be more difficult to read, depending on code style and features used. C++ is also far more fragile regarding compiler compatibility and sometimes spits rather confusing error descriptions.

I'm sure some of the people parroting that got it from their professors in college. I know mine sure did.

Care to name and shame with supporting evidence?

it's fairly well known that small kmallocs do not fail, and that there are many, many instances in the filesystem code which assume that small kmallocs do not fail. there have been two LWN articles on this exact subject.

And goto being used for error handling is pretty stock standard across most C codebases I’ve worked on or seen over the years, so I’m not sure what the particular gripe is there

Since when is "we do it all the time" the same as "it's a good thing"?

goto for error handling is not just "freeform anything goes goto". It's a very specific idiom, being an "error" label and a bunch of "if (resource) free(resource)" statements at the end of the function. It is essentially analogous to a common use case of Go's defer. Typically an accepted pattern when dealing with many resources and possible exit points. Prevalent in I/O heavy code.

Different ballgame from the subject of Dijkstra's manifesto.

> It is essentially analogous to a common use case of Go

Go, which is also known for its terrible error-handling.


Really? I have never seen a Go program to misbehave while not printing some meaningful output. It is possible but almost no one ignores the error parameter. Yet I have seen many, many Java and Python software which quit after the first Unhandled Exception. So often that I consider exceptions as the bad error handling mechanism.

It's both common and a good idea, when used purely internally to a function in order to redirect error paths to a standard set of cleanup code. It helps prevents memory errors and other problems.

This is the sense in which the Linux kernel uses goto.

I'm not experienced in C or kernel code but from my understanding they use it almost entirely like a catch/finally clause of many higher level language which is a widely accepted and successful pattern.

Properly used goto is essentially an inline exception handler. That is a good thing to eliminate repetitive fault checks and fragile cleanup code.

Since when are standard practices the same as "it's a good thing"?

Well written code is the best code. Some well written code uses goto. Some not well written code doesn't use goto

Yeah that’s the rumor but allocations of any size can fail when kmem cgroup accounting is enabled and the container is out of space.

Well, if the allocation is done with GFP_ACCOUNT bit set, i.e. kmemcg accounting enabled, isn't that the intended behaviour that the allocation can fail?

Yes that is the point. In the relatively rare case of failure this particular function uses a goto to return without releasing a held mutex. Error paths within the kernel are a rich vein of malfunction.

"We should look at the whole project as a cautionary tale of the kind of leveraged destruction that some programmers of modest ability but extreme confidence can wreak on our industry."

Oh man, can I put that on a t-shirt?

The biggest problem is that instead of vendors submitting drivers for their devices in the mainline kernel and profit for all the fixes being done there, everyone makes a fork and there continues to work. Obviously, when merging updates from mainline kernel in the forked one, something is discarded or lost.

> The biggest problem is that instead of vendors submitting drivers for their devices in the mainline kernel and profit for all the fixes being done there, everyone makes a fork

The other half of the problem is the companies that actually use these garbage dump forks and build products on top of them.

For me, getting the SoCs and chips we use running on latest upstream kernels was a high priority in platform bringup.

I only used SoC vedors' garbage dump SDKs for quick testing & some reference. And chip vendors' drivers I ported straight to upstream git version.

Of course this isn't how it goes in companies where "shit to market" is top priority.

Could this be an issue of not appropriately identifying the impact of the bug? If it was reported by an automated tool and easy to fix perhaps the developer failed to fully investigate the problem, failing to realize it is a critical vulnerability and have the fix back ported.

KASAN identified it as a use after free bug. That makes it very plausibly a security vuln. What more do you want to automate?

The problem is there are so many of those.

So should we be fuzzing the stable branches separately?

syzbot is already fuzzing the latest two stable kernels and has found hundreds of bugs, including lots of use-after-frees. All these bugs are listed here:

- https://syzkaller.appspot.com/linux-4.14 - https://syzkaller.appspot.com/linux-4.19

As far as I know, no one is doing anything with the syzbot bugs against stable kernels directly, since no company using Linux is paying anyone to do it as their job. But some are getting fixed; e.g., some get reported against mainline too, then fixed and backported.

How complex it is to test all known issues against all current kernels?

A weekly report with some easy to understand graphs would probably convince more people to work on these bugs.

Time and cost, same as it would be to do it across all kernel versions, not just current ones. Theoretically could be done pretty simply via a CI/CD pipeline if someone wrote solid test cases for the issues found by the fuzzier.

Thats my point! Make this part of the kernel regression and also run it on the old kernels.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact