Hacker News new | past | comments | ask | show | jobs | submit login
Linus Torvalds: “Do No Harm” (lkml.org)
566 points by ekianjo on Nov 22, 2017 | hide | past | favorite | 231 comments



I wrote the email that prompted this quite civil response. I'm very pleased with the outcome, because I think this clear statement of his position is a lot more useful for people to work with, rather than just assuming Linus hates security or something.

I interpreted his response in practical terms as essentially being the following. Patch set merge 1 has "report" as default and "kill" as a non-default option. Patch set merge 2 has "kill" as default and "report" as a non-default option. Patch set merge 3 removes support for "report". This way we have the best of both worlds: we eventually reach the thing that actually adds real security benefit, which makes security folks happy. And we don't break everybody's computers immediately, allowing time for the more obvious bugs to surface via "report", which makes users and developers happy. Seems like a reasonable process to me.


Though, I think that the time between PSM1 and PSM2 will be significant. Usually default options are changed once basically all distros compile with another option without widespread breakage. And once no LTS kernel with PSM1 is supported, you merge PSM3.

Might take years but atleast the airplanes keep flying instead of crashing their computers and consequently themselves.


> Though, I think that the time between PSM1 and PSM2 will be significant.

Indeed you're probably right there. Fortunately security-focused distributions and individuals would be able to change the defaults in the interim.


I think the most important point here is that with those different patch sets, the more security-conscious users/companies get to have the properly hardened version immediately into use, rather than running the "only report" versions for what might be years, as you say.

Granted, I'm looking at this from a perspective where our company compiles our own kernel for use in embedded devices, so we can have whatever patch sets supported we want. But I think that's much better than everyone having to use the "report only" patches for years, or even worse, the features never getting into the kernel in the first place.


Depending on the bug class there may be users who never want "kill" to be the default option.


The other important "users" are the developers and drive-by-developers of those user space processes which may accidentally trigger these bugs. These folks are highly likely to be able to resolve these issues if they are reported [in a way which is visible to them].

Counterpoint: most developers and users are not actively following their logs at any level, and maybe something should be done to make it more common. Possibly stderr logging of such errors in libc (or some other commonly used library which already sometimes logs errors on its own, like glib). 3:- )


It would be awesome if this could be set by a sysctl, instead of setting a kernel option and recompiling the kernel.


Patch Set Merge 1.5?


I interpreted the mail as (among other things) "killing processes is not acceptable behavior", or at least not acceptable default behavior.


Is that really a position Linus has maintained for a long time? Because I got the feeling Linus really just hated anything to do with security.

It's only in more recent years when the automotive and IoT industries have started to get involved in the Linux Foundation and them asking for more security features that he seems to have tried to find ways to "compromise" with security people.


You're going to get a very different perspective depending on if you actually lurk lkml or just read when something gets linked by socail media or "news" that needs to wrap it in a hot take to make lookies loos interested in lkml.


It's only civil because it's a follow-up; usually it's only his first email in a thread that follows the classic (notorious?) Torvalds style.

For those who want it, here's his first email in the thread, profanity and all: https://lkml.org/lkml/2017/11/17/767


Nope, his first email in the thread was this:

https://lkml.org/lkml/2017/11/17/423

Where he's quite civil and explains quite clearly why he won't accept the patch, and what should happen for the patch to be accepted. The Kees's reply insisting on the merge is what led to the profanity-laden email, and frankly I understand that (not condone, but understand from a human-reaction point of view) - how many times does one have to iterate his viewpoint to others to make himself heard?



Normally I'm not a fan of Linus, but this:

> IT IS NOT ACCEPTABLE when security people set magical new rules, and then make the kernel panic when those new rules are violated.

> That is pure and utter bullshit. We've had more than a quarter century _without_ those rules, you don't then suddenly walz in and say "oh, everbody must do this, and if you haven't, we will kill the kernel".

> The fact that you "introduced the fallback mode" late in that series just shows HOW INCREDIBLY BROKEN the series started out.

This makes perfect sense and outlines a real problem with security patching in general.


Sadly this kind of magical security thinking have many proponents higher up in the Linux stack, and they have the backing/support of GKH. Thus i worry what will happen the day Linus give up the reins.


I honestly don't think Linus will give it up until he's in a box. He lives and breathes the kernel.


Background: the "kernel self protection project" (KSSP) recently upstreamed the Grsecurity/PAX reference counting implementation which prevents a certain class of security bugs from being exploited.

Grsecurity is a security hardening patchset for Linux that makes deliberate trade-offs in favor of security, sacrificing availability if necessary. This, aside from the political issue, is the main reasons why it's hard to upstream it. Linus has called some of their mitigations "insane" before precisely for that reason. Grsecurity will rather terminate userland programs or, in some rare cases, panic the kernel if it finds itself in an undefined state. This is exactly what you want if you care about security, but it's not a trade-off everyone is happy with (including Linus).

Unfortunately, Grsecurity/PAX is not (and probably won't ever be) involved in the KSSP project, and the KSSP developers do not understand the code nearly as well as the Grsecurity team does. This lead to a situation where the new code caused a crash that they weren't able to fix in time, so they disabled the feature in the last minute.

I've been using Grsecurity for years until they stopped making it publicly available, and I remember many bugs that were uncovered by PAX_REFCOUNT and yes, occasionally panicked the kernel where a vanilla kernel would run just fine. They usually found and fixed those within hours.

Grsecurity/PAX have invented many of the modern exploitation mitigations, probably second to none. Some have even been implemented in hardware. Their expertise in building modern defenses is astonishing (their latest invention, the control flow integrity mechanism RAP, is a work of art).

Linux could be the most secure kernel, instead, it's fallen way behind Windows - which has much better defenses than Linux nowadays thanks to Microsoft's ongoing battle with rootkit writers. Go figure.

If the large companies who use Linux really want to improve kernel security, they need to work with Grsecurity and not against them. It's beyond me how this isn't happening already.


>> This is exactly what you want if you care about security, but it's not a trade-off everyone is happy with (including Linus).

Let's rephrase that. This is exactly what you want if you care only about security, or care about security above everything else - including your system running at all. People run software for reasons, and they need it to keep running for those reasons. The security folks are not really qualified to evaluate the security risks against all the reasons for all the people running Linux.


No, they really are. Don't confuse doing it poorly with the ability to do it.


> From a security standpoint, when you find an invalid access, and you mitigate it, you've done a great job, and your hardening was successful and you're done. "Look ma, it's not a security issue any more", and you can basically ignore it as "just another bug" that is now in a class that is no longer your problem. So to you, the big win is when the access is _stopped_. That's the end of the story from a security standpoint - at least if you are one of those bad security people who don't care about anything else. But from a developer standpoint, things _really_ are not done. Not even close. From a developer standpoint, the bad access was just a symptom, and it needs to be reported, and debugged, and fixed, so that the bug actually gets corrected.

As a developer, I do want the report. But if you killed the user program in the process, I'm actually _less_ likely to get the report, because the latent access was most likely in some really rare and nasty case, or we would have found it already.

I dont think Linus has an invalid point there. Taken from his previous thread - https://www.spinics.net/lists/kernel/msg2540934.html

> Don't bother with grsecurity. Their approach has always been "we don't care if we break anything, we'll just claim it's because we're extra secure".

He is worried that grsecurity does not play nice with the kernel in a way of "let's make security hardening obsolete by fixing bugs in the kernel".


Fixing all memory corruption bugs is infeasible without fundamentally changing the way Linux is developed. There is so much code (and it’s being added to, changed, etc.) written by humans that make mistakes.

There will always be some bugs that are in between being discovered (by someone, maybe malicious, maybe not), and being fixed. How else do you prevent against vulnerabilities in that stage?


Linus' response is that calling it an infeasible problem is a cop-out. The right way to go about it is to fix them all, incrementally if need be, and not break userland in the process.


These comments sound analogous to real world security and societal issues. Like, the desire to increase army size and addressing the underlying issues.

One is a short term solution, the other long term.


I think given the quantity of our planetary computation infrastructure Linux runs, it's very much a real world issue.


>If the large companies who use Linux really want to improve kernel security, they need to work with Grsecurity and not against them. It's beyond me how this isn't happening already.

It's more that Grsecurity is working against everyone else. They want to pretend the GPL works in a way that it doesn't so that they can sell their patches. Then they make threats to people who say "that's not how the GPL works" and distribute their patches in accordance with how the GPL actually works. That's not how kernel development is done. I'd rather have an insecure kernel than their bullshit.


It's their work. We're not entitled to it.


Bruce Perens, however, is entitled to not be the victim of Spender's legal harassment for exercising his First Amendment rights to disagree [1].

I have no dog in this fight whatsoever; I don't know anybody involved. But in general I have little sympathy for people, however talented, who waste taxpayers' money with bogus legal action.

[1]: https://thenewstack.io/open-source-pioneer-bruce-perens-sued...


Actually, we are. That's how the GPL works.


No, it's not. The GPL on the Linux kernel means that grsec can't distribute a new Linux kernel with their patches while withholding code. That's not what they're doing. If I write a Linux kernel patch on a consulting project, I am absolutely not required to publish it.


> If I write a Linux kernel patch on a consulting project, I am absolutely not required to publish it.

That would be a work for hire, and it is not the same thing as developing patches independently and distributing them with extra terms, because there is no distribution involved.

> The GPL on the Linux kernel means that grsec can't distribute a new Linux kernel with their patches while withholding code. That's not what they're doing.

That would be true if the patches were not derivative works of the Linux kernel in a legal sense. I'm no lawyer, but that seems contrary to the plain meaning of "derivative work".

Bruce Perens' argument is persuasive to me: https://perens.com/2017/06/28/warning-grsecurity-potential-c...

As is Linus Torvalds' ("kernel patches clearly _are_ derived works"): http://yarchive.net/comp/linux/gpl_modules.html


> That would be a work for hire ....

Not necessarily, in fact perhaps not even usually.

1. The consulting contract might or might not provide for the client to own any work product created. Many such contracts provide that the client will own only the specific end product, while the consultant retains ownership of any reusable "Toolkit Items."

But what if the contract is silent about ownership of consulting work product?

2. As to copyright: Under U.S. copyright law, the default mode is that IF: An original work of authorship is created outside an employer-employee relationship, THEN: The copyright is owned by the individual author (or jointly by multiple co-authors) UNLESS: A) the work of authorship falls into one of nine specific statutory categories, and B) the parties have expressly agreed in writing, before the work was created, that it would be a work made for hire. [0] [1]

3. Any patentable inventions would be owned by the inventor(s) unless they were employees who were "hired to invent" or "set to experimenting," in which case the inventions would be owned by the employer; so far as I recall, this doesn't apply in the case of outside-contractor consulting projects — the client would not own any resulting inventions unless the contract specifically said otherwise. [2]

[0] https://www.law.cornell.edu/uscode/text/17/201 (ownership of copyright)

[1] "A 'work made for hire' is—(1) a work prepared by an employee within the scope of his or her employment; or (2) a work specially ordered or commissioned [A] for use as a contribution to a collective work, [B] as a part of a motion picture or other audiovisual work, [C] as a translation, [D] as a supplementary work, [E] as a compilation, [F] as an instructional text, [G] as a test, [H] as answer material for a test, or [I] as an atlas, if the parties expressly agree in a written instrument signed by them that the work shall be considered a work made for hire. [¶] For the purpose of the foregoing sentence, a 'supplementary work' is a work prepared for publication as a secondary adjunct to a work by another author for the purpose of introducing, concluding, illustrating, explaining, revising, commenting upon, or assisting in the use of the other work, such as forewords, afterwords, pictorial illustrations, maps, charts, tables, editorial notes, musical arrangements, answer material for tests, bibliographies, appendixes, and indexes, and an 'instructional text' is a literary, pictorial, or graphic work prepared for publication and with the purpose of use in systematic instructional activities." From https://www.law.cornell.edu/uscode/text/17/101

[2] See the annotated flowchart at http://www.oncontracts.com/docs/Who-owns-an-employee-inventi... (self-cite).


Right, but they do distribute modified kernels, and we are therefore entitled to their work.


You're spreading FUD.

I suggest you educate yourself on the reasons grsecurity patches are no longer public anymore.


You're welcome to present an argument for your case.


> Grsecurity will rather terminate userland programs or, in some rare cases, panic the kernel if it finds itself in an undefined state. This is exactly what you want if you care about security, but it's not a trade-off everyone is happy with (including Linus).

I'd also like my kernel to halt whenever an assertion does not hold, for the sake of keeping my sanity; not just for security.

Why would you not want this?


For the same reason people drive with their “check engine” light on: It’s frequently better to have a working system (i.e. “I’m late for work”), than to chase an indicator that may not represent a real problem (an actual security intrusion).


I can't think of single useful piece of software nowdays that is exposed to public and can't run in active-active load balanced or clustered scenario. If your kernel/system/userland-app misbehaves it simply needs to be shut down, reported and examined. It might have been some random memory block the last time your app made an buffer overflow, but it could as well be the stack pointer next time...


Remember, we're necessarily just talking about servers here; every single hospital has mission-critical client machines that cannot go down and obviously those aren't load balanced or clustered. (Though mostly they seem to be running Windows.)


For safety-critical systems, resetting on a fault is very much factored into the worst-case response time and expected behaviour.

PANIC on fault is exactly what you design into the systems.


So why is not the world running on C64s?


The world does run on microcontrollers that have roughly the same processing capability as a c64...

The vast majority of processors sold are not i5 or i7 level but microcontrollers.


Web browsers are, for practical purposes, exposed to the public. Linux doesn't run only on servers.


So what happens when your browser crashes? I experience that on a regular basis. Id' rather have my browser crash/killed instead of slowly overwriting my filesystem buffers or corrupting my stack pointer... Other than that browser are multi-thread/process applications. Usually only a single tab or a plugin crashes unless core browser process is affected. Most users would accept the trade off between crashed browser and infected/corrupted system.


> Most users would accept the trade off between crashed browser and infected/corrupted system.

Most users are using computing devices a means of getting stuff done. They don't want to spend any energy thinking about how their software works, they want their devices to be invisible, which they use to run their Apps uninterrupted. The trade-off is whether to let Apps continue running vs hard crashing and taking down all the work they've done and all the mental energy and focus invested up to that point. If their Apps frequently crash most users aren't thinking, well I'm super glad the hours I spent on this paper I'm working on is now lost, the phone calls to my loved ones or movie I'm watching are abruptly terminated because someone's policy on hard crashing when a bug is found has been triggered. Their preferences and purchasing power are going to go towards non user-hostile devices they perceive provide the best experience for using their preferred Apps without any need for pre-requisite knowledge of OS internals.

There's not a single computing device that frequently crashes as a result of security hardening that will be able to retain any meaningful marketshare. Users are never going tolerate anything that requires extraneous effort on their part into researching and manually applying what needs to be done to get their device running without crashing.


Apps are supposed to keep their state either by saving your work regularly to persistent media or keeping your data off-client. We're living in 21st century in a cloud era FFS.

Keep running your app although integrity corruption within the application happened is putting user data at risk. IMHO an application that corrupts 3 days long presentation file save is to every user more frustrating than the one that crashes due to error leaving you with 5 minutes of unsaved changes lost.

Microsoft have invented "Application Recovery and Restart" exactly for this purpose.


> Keep running your app although integrity corruption within the application happened is putting user data at risk.

If user data is continually backed up to a remote site it's not going to be at risk from a local bug is it? Bugs exist in all software, Users are going to be be more visibly frustrated from their Apps frequently crashing then the extremely unlikely scenario where a detected bug corrupts their "3 days long presentation". They're going very unhappy if the cause of their frequent data loss was due to a user-hostile setting to hard crash on the first detectable bug.

> Microsoft have invented "Application Recovery and Restart" exactly for this purpose.

From Microsoft website:

> An application can use Application Recovery and Restart (ARR) to save data and state information before the application exits due to an unhandled exception or when the application stops responding.

- https://msdn.microsoft.com/en-us/library/windows/desktop/cc9...

i.e. restarting Apps due to "unhandled exception or when the application stops responding" in which case the App is in an unusable state and ARR kicks in to try auto recover it for minimal user disruption. The focus on providing a good UX, not a miserable crash-prone experience where users use their devices in fear that at anytime anything they're working on can be terminated abruptly without warning.


You clearly have limited view on application bugs. Let me elaborate a bit on bugs causing application dissatisfaction and UX frustration without crashing much, much worse than a simple error message along the lines: "OS has terminated application X because it has performed an illegal operation."

Data corruption - reading or writing corrupted data - files cannot be read, saved files get corrupted, API calls from/to external applications/systems fail or pass incorrect data Rendering problems - corrupted images, incorrect colors, improper content encoding, visual stuttering, audio deformation, audio skipping Input/output lags - unregistered kaystrokes, missed actions and responses to external events, mouse stuttering and misbehavior Improper operation - inconsistent results - repeated rendering yields different results (html), formulas/calculation results in data is inconsistent (excel, DWH) Access violation - access gained to invalid or protected areas - unprivileged access, license violations, access to areas protected by AAA, data theft (SQL injection, database dumps)

and others. If I figure out the application I'm using (web-browser) allowed a hacker to steal my data he would not have otherwise access to I would be more pissed off than if it crashed and I found an error about it in system log.


the standard windows user will not read the system log.

some people just use computer to do stuff to them there is little difference between "i lost my work because of a bug" and "i lost my work because of a security policy"; from a UX point of view both of them are the developer fault for releasing inadequate software.


> You clearly have limited view on application bugs.

Please leave out the uncivil swipes.

https://news.ycombinator.com/newsguidelines.html


Meta: Who, and why, flagged this comment? What rule exactly Slavius breaks here?

On topic: I can't recall now the details, but I read a paper once about a system which had no shutdown procedure at all, the only way to exit it was to crash it somehow or just shutdown the computer. The system made sure to save everything often enough and made sure to store the data in ways which allowed for restoring possibly corrupted parts of it on the next startup. This design produced a very resilient architecture which worked well for that use case.

The paper was from '80s or '90s, so it's not like we need to be in 21st century to design that way. I'll try searching for the paper later.


You might be thinking of KeyKOS, and of the anecdote which can be found at https://lists.inf.ethz.ch/pipermail/oberon/2010/005734.html (it should also be at the EROS homepage, but it's down for me at the moment).

See also: "Crash-only software" https://lwn.net/Articles/191059/


Yes, exactly this! Thank you.


The flagger probably was uncomfortable with "FFS". After all colorful expression is bad for HN. b^)

What you're talking about seems like crash-only with Erlang/OTP.


It's similar in effect, but Erlang's ultimate response to the errors is redundancy instead of trying to salvage whatever was left by the process that crashed. I think the transparent distribution of Erlang nodes over the network is what enables Erlang's "let it crash and forget it ever ran" approach. Joe Armstrong said that they want Erlang to handle all kinds of problems, up to and including "being hit by a lightning" - so I think hardware redundancy is the right path here.

The OS[1] I've been talking about was primarily concerned with a single-machine environment, which resulted in slightly different design.

[1] https://en.wikipedia.org/wiki/EROS_%28microkernel%29


> or corrupting my stack pointer...

in that case, it will crash with a SIGSEGV sooner or later anyway


...or is being remotely exploited and it silently succeeds. Who wants that?


That is very unlikely. Crashing would happen 100% of the time though. Most people want that trade-off (meaning: If their browser would crash, they would switch to another one, even it was less secure).


Stack pointer manipulation is the entry point for an extremely large subset of security issues.


Corrupting SP is part of almost every exploit and I can guarantee you that it is very likely (going to cause harm on your system). Try to pull Metasploit GIT repo to get some idea about thousands of payloads that do corrupt SP without crashing the host...


Yes, but how many of all cases of corrupted stack pointers are exploits?


Why would that matter? We're not trying to be secure against random cosmic rays. We're trying to be secure against attackers.

http://wondermark.com/406/


It matters because we're talking about letting the browser crash on all cases.

> We're trying to be secure against attackers.

We also want a browser that doesn't crash.


Never had a single problem take down all of your instances at once, eh?


Say there's a minor error in a network driver. Yes, it might be exploitable by a smart person. But the error only triggers once a day when a counter rolls over. Do you really want your box to lock up and panic when this error is encountered, or do you just want your box to keep working.

I'm firmly in the first camp (I'll take lock up and freeze thanks) but 99% of users don't care about a bug like that and just want the box to keep working.


But do you want your box to send silently corrupted data for the next two years? Or would you rather reboot every night, and maybe escalate to your red hat support contract, where someone will then fix the underlying bug (for which you now have crashdumps),


I'd want it to log that it's going wrong, and report that so that it can be fixed.


What data exactly is being corrupted?

Fail fast is a great philosophy for end-user software. But it is not that strictly good for middleware, and is almost certainly wrong for a kernel.


If you're a desktop user, or a sysadmin without said support contract, you want the former.


That Redhat support contract won't save you from a bug in a binary blob network driver.

Crashing the whole kernel at the drop of a hat seems like a pretty extreme stance to take as a general policy IMHO. Killing and restarting the driver will usually suffice, although some data may be lost and have to be retransmitted.


I want both. Panic in a test/development kernel, do not panic in a production environment.


It the opposite...

you should panic in a production environment and reset the state of the machine (which has become indeterminate).

The correctness and validity of the data >> uptime.


As so often, it really depends. Let's say you've just detected that you're going to send incorrect data because you've ended up in an indeterminate state.

If the remote end is going to ignore that data anyways, would it really be such a bad idea to keep running? Do you really want to go down in order to ensure that a remote who's ignoring your data can get correct data to ignore?

Of course you never know what sort of effect the corrupt data is going to have, so it's always hard to make that decision.

Like that issue with libraries linking against objective-c frameworks that has crept up with High Sierra and broke most of the Ruby world: Yes. The usage was incorrect. Yes, forking after threads are launched leads to undefined behavior, yes, knowing about it is a good thing.

But: So far the crashes have been rare in the common use-cases (or they would have been fixed), so High Sierra's change to blow up loudly when it detects the misuse has actually caused a lot of trouble for people where things worked fine before.

To the point where many Ruby developers were complaining about High Sierra "breaking" their workflow and recommending against upgrading.

The new check is totally justified though. The existing forking behavior was wrong and it could have lead to crashes down the line. It didn't though. And now people are forced to fix something that was never an issue to begin with.

It's a fine line to walk and while I generally prefer things to blow up as they go wrong, sometimes I catch myself wishing for stuff to just continuing to work.

On some self-reflection, I come to the conclusion that I want my cake and eat it too.


There rule I work to when I design these type of systems is that if the source of the error is internal you should reset and avoid propagating the error. Conversely if you receive an error from an external source you should handle it gracefully and reject the bad message.


On the other hand, the mere act of panicking may corrupt data (by virtue of stopping processes). I learned this the hard way when my kernel panicked while I was shrinking a large ext4 volume (the panic was unrelated to the shrinking). It's not just a simple equation like you've claimed.


A panic should stop the processor dead, no data should be corrected as a result. Data in flight should not be used if you use transactional I/O and therefore will not be used if a write does not complete.


That's fine in theory, but it didn't stop my disk from being corrupted. If the computer hadn't panicked, my data would still be available.


its not theory... your system wasn't designed to be fault-tolerant in that manner.

Systems that matter, are.


The linux kernel needs to work with both tolerant and non-tolerant systems. Saying it needs to work a specific way that completely breaks real world things is completely naive, and exactly what Linus was railing against.


i use linux mainly to write latex these days, i don't want my kernel to panic, I do want my machine to stay operative and not corrupt my work.

The kernel (for my intended usage) should intentionally panic only if there is a risk of corrupting my .tex files.


Absolutely panic in a production environment.

Potential data corruption is far worse, so is a potential security compromise.


It really depends on the use case!

E.g. performing industrial control automation or airplane rudder control in a completely segregated network.

You want control over these tradeoffs, not hardcoded behaviors.


Linux is ill suited for that.

Look at seL4, minix3, eChronos instead.


Well, only that most devices never were tested over months in development kernel ... and it is not possible to do so, with all those million different devices around.


> Why would you not want this?

    I was writing paper, on a PC, that was like "pip pip pip pip pip" 
    and then... like half of my paper was gone.. and I was like... 

    It devoured my paper.

    It was really good paper. And then I had to write it again and 
    had to do it fast so it wasn’t as good. It’s kind of... a bummer.
https://www.youtube.com/watch?v=VMt2MK67-Qw


Linux is used in so many critical systems.

What happens when a security bug stops the ventilating machine of a person lying in hospital bed, or halts the screen of a surgeon.

Not to mention voting machines, ISP's, telecoms.

For me having all those stopped, when properly exploited, looks more like a very scary DoS attack vector.

Imagine a security f*ck up, like Heartbleed, but this time with an option to halt kernels / systems.


First things first: Kernels panic and processes crash. If your medical equipment or telco/ISP system can't recover from that then you're in trouble anyway. Why they crash doesn't really matter in that context.

As far as voting machines go, kernel panic sounds waaay better than executing malicious code.

> Imagine a security f*ck up, like Heartbleed, but this time with an option to halt kernels / systems.

IIRC heartbleed didn't allow you to execute code (it allowed you to read more memory than you should have been able to). A better example is every flash player bug ever. Would you rather that thing crashes or executes malicious code? Keep in mind that the malicious code can also shut down your system.

Also keep in mind that we're talking about userspace programs right now. This thread is about kernel bugs. Userspace programs already have the option to ask the kernel to kill them if they misbehave. A lot of them do that (using features like seccomp filter) and many more should. (Chrome and Firefox both use seccomp filter I think.)


I see. It's ok because we'll just pass the buck and make it someone else's problem.


Let's say somebody gives you an USB stick and you plug it into your laptop. Which of the following scenarios would you like to see?

1. 0-day in the kernel's USB code. You're part of stuxnet now.

2. 0-day in the kernel's USB code. You're part of stuxnet now. You also get a message that tells you how and where to report the bug that was exploited.

3. 0-day in the kernel's USB code. Your computer crashes. You're not part of stuxnet. You also get a message that tells you how and where to report the bug that was exploited.

Linus is wrong (it happens). Exploit mitigation techniques aren't debugging tools. They're exploit mitigation techniques. The fact that they also produce useful debugging information is secondary.


This is exactly the kind of thinking Linus is talking about. In 3, I lost my work. Possibly very important work. To most people, being a part of stuxnet, while undesirable, is preferable to losing their work.

And you neglected a scenario 4: nobody is attempting to compromise my machine, but a buggy bit of USB code just crashed my system and took all my work with it.


You never learned at school to save what you are working on often? It's crazy, you are either too old to have required computer for school work or too young to have not lived through years of constant bluescreens.

Both 3 and 4 are mitigated by you saving your document often... it's not so bad, considering it can happen whatever you do.

Nowadays, Word is made to keep saving your change for that reason... They learned and designed it for the worst situation which is a whole system crash. If you can't handle that, well you aren't doing your work well.


Again, passing the buck. You know what I do instead of use your software that crashes all the goddamn time? I use someone else's software that doesn't.

Yes, of course we should save often, have decent backups, etc. But nobody is perfect and shit happens, and it'd be nice if the software you use didn't intentionally make it worse.


The problem is, what actually happened (in a previous commit) was:

The IPv6 stack does a perfectly sensible and legal thing. The hardener code misunderstands the legal code, and causes a reboot.

That it was Linus is worried about -- often it is hard to tell the difference between "naughty" code which can never be a security hole, and genuine security holes.

They should all be fixed ASAP, but making code that previously worked make a user's computer reboot, when it is perfectly fine, is not a way to make friends.


Bugs in the hardening code are obviously bad and annoying but that's besides the point. All bugs are bad and annoying, especially ones that cause a kernel panic. I don't think anybody is going to argue with that.

That's not what Linus said though. What he said is:

    > when adding hardening features, the first step should *ALWAYS* be
    > "just report it". Not killing things, not even stopping the access.
    > Report it. Nothing else.
and:

    > All I need is that the whole "let's kill processes" mentality goes
    > away, and that people acknowledge that the first step is always "just
    > report".
"Not killing things, not even stopping the access." Oh boy.


Step back a bit: when developing a new selinux policy, won't you develop first on permissive mode, and only after it's working without warnings, enable enforcing mode? It's the same thing here: the hardening should be developed first in a "permissive" mode which only warns, and then, after it's shown to be working without warnings, changed to be "enforcing" (in this case however, after some time the "permissive" mode can be removed, since new code should be written with that hardening in mind).


I didn't mean that to sound like I'm in favor of turning the thing on right away.

(Also, the quotes I chose don't really help me make my case but I don't want to edit now since you've already commented on it. His first mail is way worse: https://lkml.org/lkml/2017/11/17/767)

Basically what I'm disagreeing with is that exploit mitigation's primary purpose is finding and fixing bugs. That's just not true. Its primary purpose is to protect users from exploitable bugs that we haven't found yet (but someone else might have).


By first step, Linus just means "for a year or two". Yes it would be nice to put super high security on today, but instead we slowly turn up the setting, from opt in to opt out to forced on, to ensure we don't break anything.


4. 0-day in the kernel's USB code. Your computer crashes. You're not part of stuxnet. You also get a message that tells you how and where to report the bug that was exploited, but the part of your computer that was supposed to log the message died with the rest of the system, so you never see it and the bug never actually gets reported. Your computer continues to crash randomly for the next few days as an infected computer keeps trying to spread.


> What happens when a security bug stops the ventilating machine of a person lying in hospital bed, or halts the screen of a surgeon.

Linux is not a kernel for this kind of uses. Whoever does it, is doing a disservice to the people.

Operating systems like INTEGRITY RTOS or similar, are the only ones able to match the security quality requirements for such deployments.

https://www.ghs.com/products/rtos/integrity.html


"Linux is used in so many critical systems"

It is? I learned that in those areas you use different, much simpler and therefore more stable operating systems.


> Linux is used in so many critical systems.

It shouldn't be. It's ill suited for that.

Look at seL4, minix3, echronos instead.


Well apparently even minix3 is not free from critical vulterabilities ;)

https://security-center.intel.com/advisory.aspx?intelid=INTE...


Intel's shitty apps running on Minix != Minix itself


Minix 3?


Hooold it. Some of those things are not like the others.

--

I pity the engineers working on ventilation machines and the like. Medical devices are insanely hard to get right; that's neck and neck with aviation testing. I'm reminded of SQLite3's "aviation-grade" TH3 testsuite, which apparently has 100% code coverage. Let's be honest; Linux's monolithic design can't really attain that.

I would never use Linux for a medical device. I say this as someone who just happens to only be running Linux on every machine in the house right now (and I have for years, it's just how things have worked out, it's not at all novel or whatever, my point is that I'm totally comfortable with it). I'd use L4 or something instead. In a pinch I'd use a commercial kernel with tons of testing. Maybe I'd even use Minix; I'm quite sure a lot of people in industry are seriously looking at it now Intel have pretty much unofficially greenlit it as a good kernel (lmao).

--

Voting machines, on the other hand; I'd totally use Linux for that, because the security/usage model is worlds apart. Here, I WOULD ABSOLUTELY LIKE FOR THE TINIEST GLITCH TO CRASH THE MACHINE, because that glitch could be malware trying to get in.

The user experience of a voting machine is such that you walk up to it, identify yourself, and push a button. Worst case scenario in this situation is that you do some involved process to ID yourself and then the unit locks up, so you have to redo the ID effort on another unit. That is, for all use cases, not going to be a problem.

(I think that's the first time I've used all caps in years!)

--

Telecom systems... those are also a totally different world. See also: Erlang. In this situation you would likely want a vulnerability to literally sound a klaxon on a wall, but have the system still keep going.

I'm reminded here of an incident where a country's national 3G system was compromised (not the US, somewhere else) by hackers and the firmware of the backend systems was hot-patched (think replacing running binary code - the OS allowed it, it was REALLY hard to even notice this was happening) to exfiltrate SMS messages and cause calls to certain numbers to generate a shadow call (which ignored mic input) to an attacker-controlled number as well.

Telecoms is a classic case of massive scale; nowadays a single telecom switch might be routing thousands of calls through at a time. Yeah you don't want even a single machine to go down. But you DO want VERY thorough debugging, auditing and metrics.

(Which apparently don't exist.)

--

As for a Heartbleed-esque catastrophe, apparently one is going to be announced for Intel ME at the upcoming Blackhat(?) conference in December. I can't wait to hear about it myself.


Erlang's error-handling model is good (and interesting). Th e motto is: "Let it crash".

Each node does not handle errors at all, but PANICs on a fault. It is up to the supervisor (with global knowledge and state) to handle the fault appropriately.


It's good for uptime, but not good for correctness. The main problem is that it is hard to differentiate expected from unexpected crashes. Something like a missing pattern match can lead to a crash and it is very hard to know if the programmer "intended" for a crash to occur in that case or if the missing pattern is a bug.

You can have processes that reboot once every few minutes running for years because people didn't realize they were bugged.


>As for a Heartbleed-esque catastrophe, apparently one is going to be announced for Intel ME at the upcoming Blackhat(?) conference in December. I can't wait to hear about it myself.

Light on details, but the vulnerabilities are disclosed and fixed [0]. ME updates are already available from many OEMs.

[0] https://security-center.intel.com/advisory.aspx?intelid=INTE...


Right. But "don't apply the patch!" is sort of circling as well, because (presuming the Blackhat disclosure is workable, it sounds like it will be but fingers crossed) we might be able to play with our MEs.


Many medical devices run Linux. Most (AFAIK) patient monitors run Linux; GE and Philips (the biggest is business) both run on Linux. Those are the devices that keep you alive during surgery, make sure that those who are born too early (I don't know the English term here) are doing ok, monitor you state while you are in ambulance etc.


No...

Many medical devices run Linux as a User-Interface... (or Windows for that matter).

The actual safety-critical portion of these systems is rarely running Linux, but rather on a bare-metal micro.


That makes a lot of sense.

I'm reminded of a UAV doing the same thing. It ran L4 for low-level control, realtime scheduling, and security, and then virtualized Linux on top of that.

Sounds unbelievably clunky on the surface, then you realize it's a remarkably useful way to abstract everything cleanly.


Born prematurely..


For safety-critical systems, resetting on a fault is very much factored into the worst-case response time and expected behaviour.

PANIC on fault is exactly what you design into the systems.

What you find is that the truly safety-critical portion of the system is running on a microcontroller and the UI (which is not safety-related) can run on Windows or Linux.


Thank you for that!

Seems like so many in security fail to see the DoS implication.

There is no solution to bugs other than fixing them. And that's what Torvalds and others have been saying: for a security researcher, finding the bug is the end of the job. For developers, that's just the start.


A better way of course is not to halt on assertion, but to limit the scope of any potential problem in a such way, that an assertion could only crash a tiny isolated thing and trigger its restart, possibly not impacting availability whatsoever. You still get your sanity, but also get users happy with a rock solid thing that just works even in a presence of errors.

The idea is known as supervision trees.


Erlang VM works this way, if someone's looking for a program that does this in practice. They have the mantra "let it crash". Something higher than you is in a better position to handle your error and restart you to a known good state.


I think a good tradeoff could be that with containers, the individual containers are hardened, whereas the kernel's host OS is not. The host OS doesn't do much except keeping the containers running.


AFAICS, Linus also wants it, but he wants a panic to be preceded by a rather lengthy span with just a warning, allowing the concerned dev to actually fix the error. Essentially he's saying: "take it slow, and don't break user experience."


Due to the aggressive nature of grsecurity, a lot of the assertions it trips on are bogus; either they didn't understand the code they were securing or they changed the rules without properly updating all the affected code. For example, there was a particularly obnoxious panic in the tty layer a few versions back that was entirely the result of this.


Servers: yes, phone/home PC: no. Maybe my dev machine but my wife wouldn't be happy with a kernel panic while writing an email. Like Linus said, this would go unreported with the average user because they just reboot in annoyance.


More and more it feels like the _sec world wants to go back to C64s dialing into big irons...


Because you'd rather just do your work and not have to deal with unnecessary kernel panics?

If the system can continue running, it should do.


When the invalid write overwrites some piece of data your application doesn't care about (or more likely some feature in some driver you don't care about). Especially when the trade-off is the web site goes down.


For the cases where it was a false-positive.


A.k.a. 99.9999% of the time.


> Why would you not want this?

You don't want your machine to start crashing after installing the latest kernel. Or at least, that's the golden rule of Linux development.

If phones start crashing after installing the latest Android update, people won't see this as a security/stability improvement. They'll simply see the new version as buggy and of poor quality.


It's a deliberate trade off. Not panicking results in uncaught exploitation attempts, and panicking will result in crashes where a vanilla kernel would happen to survive.

It should have been made a sysctl toggle.


If an assertion does not hold, you have a real problem. The kernel has in-memory data corruption. Either from buggy code, bad memory, solar radiation, etc.

So if that assertion is in the file system, maybe your kernel should die before it corrupts your data permanently.


Depends. How attached are you to getting work done today?

I too would like to live in a world where sanity preserving assertions have rational consequences. But that world is not this world, and pretending it is won't help you get there.


> This is exactly what you want if you care about security,

A deliberate panic could be the basis of a denial-of-service exploit.


"Grsecurity will rather terminate userland programs or, in some rare cases, panic the kernel if it finds itself in an undefined state. This is exactly what you want if you care about security, but it's not a trade-off everyone is happy with (including Linus)."

If you really cared about security, you'd leave the box unplugged.

"I remember many bugs that were uncovered by PAX_REFCOUNT and yes, occasionally panicked the kernel where a vanilla kernel would run just fine. They usually found and fixed those within hours."

Speaking as someone who has done middling large scale production administration, that's not reassuring.


Why isn't Grsecurity publicly available anymore?


https://grsecurity.net/announce.php

Reading between the lines, it seems to be a money thing.


Sigh. Tragedy of the commons.


Because Linus hates them? Make Brad Spengler the primary maintainer of the Linux kernel and we may actually get self-driving cars that don't kill us when they get hacked in 5 years.


> Make Brad Spengler the primary maintainer of the Linux kernel and we may actually get self-driving cars that don't kill us when they get hacked in 5 years.

But only because every self-driving car project out there will avoid Linux like the plague.


Instead of self-driving cars maybe crashing from being hacked in some possible future we'll get kernel panics leading to crashes in all possible futures because we'll trigger car crashes on every false positive, because crashing in the face of the unknown is a seemingly acceptable solution to a security risk. Even if the software is running self-driving cars and crashing may mean crashing. In practice, false positives are way more common than exploits and in many user cases they'd rather have 1 computer exploit than 1000 or 10,000 crashes.


You would rather have a self driving car in an undefined state, rather than having it shut down? A random glitch could be just as bad as an exploit; if some chunk of memory gets overwritten and your car decides that the brick wall doesn't actually exist any more, I don't think whether it was an an exploit or not really matters. The occupants end up injured either way.


The undefined state might be in the GPU driver handling the heads-up display, or maybe in the sound subsystem. No need to shut down the system at the kernel level for that. Report the issue to the userland, so that it can decide whether to initiate a safe halt at the sidewalk or emergency lane.

After telling the kernel to shut down immediately you don't have that option any longer.


It's interesting to see this laser focus on a particular kind of user. If you're running Linux on a server, you're a user, but unless you're very irresponsible you would probably rather your programs crash than give away private information. Your interface is to a cluster of machines where individual crashes are probably not that big a deal.

If you're running Linux via Android, you're a user, but mostly you're a user of actively developed apps on top of an actively developed OS, usually pegged to specific kernel versions. Your interface is to that layer on top, and given that its code is written by app developers and hardware vendors who will ship anything that doesn't crash, you probably want security bugs to crash.

It seems to me that the kind of user Linus means when he talks about "the new kernel didn't work for me" is a user of Linux without any substantial layers on top, where kernel updates happen more often than userland software updates, and where individual crashes have a significant impact. In other words, users of desktop Linux.

But I wonder if that focus on desktop Linux really reflects the majority of users. And, if not, perhaps it might make sense to have "hardening the Linux kernel" as the first step if it makes "raise the standard for the layers built on top" the endpoint.


>you would probably rather your programs crash than give away private information

Crashing on a security issue is a good thing for every kind of user. Crashing on a latent bug that COULD be exploited (maybe not possible at all) is a totally not desirable situation. The problem here is that hardening methods lack the ability to make that distinction.


> Crashing on a latent bug that COULD be exploited (maybe not possible at all) is a totally not desirable situation.

How do you square this with the reality that "keep on truckin" is generally the path from bugs to security exploits, and has been shown to be over and over in the wild?


Bugs will happen, that's a natural law of computer science, if you keep on trucking over them you will be delivering buggy software that is likely to cause problems. Even if you chase them down and correct them all, your software is still going to have bugs, that's a fact of life.

Should code containing bugs be allowed to run? If the answer is no we must ask ourselves how much software we have today that is completely bug free (that will be 0%).

I still think these proactive approaches are good to disclose possible exploits, but killing processes just because they might be exploitable is a very long shot.


And yet one of the big complaints about Windows of old was how often it crashes.


Crashing randomly for no good reason isn't the same thing as crashing on a security exception.


One of the point of Linus is that most of the crashes will not be because of an active attack but because of a possibly latent bug that sometimes appear and might very well be non exploitable. So probably people running server would like to have everything working instead of having random crashes on processes or drivers that are not the core of your service but can affect it. And given the size and complexity of the kernel it would not be strange to have these crashes appear only on certain setups and not necessarily on the ones of the people that are testing it first.


Sadly, fault-tolerant clusters where you can tolerate the loss of a single machine aren't the norm.

There are many (MANY) server applications or industrial use cases that do not handle random kernel panics very well.

I still prefer "crashing" over "silently ignoring critical errors", but you cannot generalize it like that.


I don't think this is fair characterization.

Errors are less likely to be actual exploits in servers too. When the kernel panic is caused by faulty driver failing network hardware, or user land software failures, it can take down multiple servers or all of them at once.

Most of the information in servers is private but not sensitive. You don't want anyone have access, but correct functioning and security warnings are more important than maximum information lock down.

btw. I don't see reason for not having kernel option to turn warnings into kernel panics.


It’s not even desktop users; most desktop users download Ubuntu and never touch anything, on reasonably common PC hardware. Kernel regressions mostly get caught in the Ubuntu betas or testing tracks (e.g. Debian Sid).

The typical user the kernel developers focus on here is a kernel developer: always running the latest kernel with a stable user space. I find it extremely narcissistic that they reject security improvements for billions of devices, for what essentially just makes developers lives easier,


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: