Hacker News new | past | comments | ask | show | jobs | submit login
The Linux kernel will fix some peculiar argv usage in execve(2) (utcc.utoronto.ca)
53 points by ingve on June 6, 2023 | hide | past | favorite | 54 comments



There are three things I expect to find in mailing list discussions before going off and reading them:

* Someone suggests that the kernel return EINVAL for null or zero-length argument vector. Someone else then comes up with a mad but very real mainstream program that relies upon the system call succeeding.

* Someone points out that the SUS requires that argv[0] be non-null. Someone else tries to weasel a difference between "shall" and "should", overlooking the SUS rationale that the leeway given is in the string contents, not that it is permitted to be outright null.

* Someone suggests in all seriousness that this behaviour be retained for historical compatibility.

For reference:

* FreeBSD just returns EINVAL for a zero-length argument vector, https://github.com/freebsd/freebsd-src/commit/773fa8cd136a57... . This came from OpenBSD.

* A null argument vector has been EFAULT in FreeBSD since 2004, when someone noticed that the manual disallowed this, https://github.com/freebsd/freebsd-src/commit/7700eb86e7740c... .

* DragonFly BSD has been fixing up a zero-length argument vector by adding in a dummy non-null argv[0] since 2005, https://github.com/DragonFlyBSD/DragonFlyBSD/commit/66be6566... . It introduced EFAULT for a null argument vector at the same time.

* Illumos has returned EFAULT for a null argument vector since at least the point when it went open-source in 2005.


> Someone else tries to weasel a difference between "shall" and "should",

I find this sentiment surprising coming from someone pedantic enough to have written:

https://jdebp.uk/FGA/han-unification.html

https://jdebp.uk/FGA/fga-not-faq.html

https://jdebp.uk/FGA/questions-with-yes-or-no-answers.html


That "Only ask questions with yes/no answers if you want "yes" or "no" as the answer" is so tilting... by the way, did you know that in e.g. Russian there are no separate words for "can" and "may", so the taunts of the "Can I please go to the bathroom? — I don't know, can you?" kind are even more infuriating than in English?


> in e.g. Russian there are no separate words for "can" and "may"

I disagree. For "can", in Russian you would use the verb мочь (in the case of your example: я могу сходить в туалет?, lit. "I can / am able to go to the bathroom?"); for "may", you would use the predicative можно (in the case of your example: можно сходить в туалет?, lit. "is one allowed / is it possible to go to the bathroom?").


>This came from OpenBSD

Thanks, you saved me some looking, I thought OpenBSD did something in this case and was going to poke around.

To bad the BSDs and Linux could not agree on what to do :) But really, for what I develop this is a non issue.


The last point ("we don't break userspace") has always been a core tenet of Linux. Linus essentially refuses to alter kernel behaviour just to be "correct" according to POSIX lawyers


Linux does break userspace; either when no one would notice (in which case it's arguably not really "breakage") or when there's a serious enough issue that really can't be resolved in any other way. "Don't break userspace" is a hard rule, except when it's not.

The question here is: will anyone notice the breakage? And is this a serious enough issue to risk it?


> either when no one would notice (in which case it's arguably not really "breakage")

This is the idea, but the Kernel maintainers aren’t omniscient, so “no one would notice” really means “the maintainers don’t think anyone would notice”, and this ends up being somewhat arbitrary.

A breaking change is a breaking change: just because you don’t think anyone is relying on the existing behaviour doesn’t make it any less of a breaking change.

I really don’t like it when Linus says the Kernel never breaks userspace (he says this a lot and uses it as an excuse to abuse people on the mailing list) because it’s just factually inaccurate, and “no one will notice” isn’t really good enough if you expect people to actually rely on this property.


> “no one would notice” really means “the maintainers don’t think anyone would notice”, and this ends up being somewhat arbitrary.

Yeah, that's pretty much what I meant, but this is phrased better than I did.

I do think that "Kernel never breaks userspace" is accurate in the common-understanding sense and a good policy to have, even though it's indeed not accurate in the sense that it's "100% technically accurate" (Linux also just breaks things by accident for example).

You could use a qualified statement like "Linux takes great care not to break userspace, except when there is a overwhelming case in favour of it, or when we think no real applications will break with it, and we don't always roll back accidental breakage if we don't think it's worth it" and that would be more accurate, but that's rather vague, and a clear "never break userspace" is a much clearer better as a policy for everyone.

In interviews and such Linus is of course much more nuanced than "never break userspace".


Yes, I’m splitting hairs over the semantics perhaps.

I think given some of Linus’s LKML rants directed at people who have broken userspace, it’s reasonable to take the claim at face value and assume that there is a hard guarantee of backwards compatibility (modulo bugs).


I think people sometimes pay a bit too much attention to "random Linus posts/rants on LKML". As far as I've seen, these kind of outbursts were never the default/standard behaviour but the exception, and I believe Linus toned down a bit in recent years.

I'm not fan of his outbursts by any means, but you have to see these things in context. Imagine everything you post on the internet will be read by heaps of people and still be quoted years from now. Often my posts on e.g. HN don't express my full nuanced views either, because it's not germane to the topic at hand, because I phrased things badly, and sometimes I'm just dead wrong. No one quotes all the stupid stuff I said 15 years ago though, or the stupid stuff I said yesterday.


Linus' rants were usually directed at high-ranking maintainers who really should have known better than to accept those.


Well, adding two paragraph long disclaimer where people competent enough know what he means anyway would be counter-productive.

The fact it sometimes need to be broken if it is unfixably insecure is obvious conclusion for anyone that thinks about the problem instead of looking for reasons to complain


> The question here is: will anyone notice the breakage?

Few months ago I wrote a feature request and patch to GNU coreutils that resulted in some discussion of this NULL argv issue.

https://lists.gnu.org/archive/html/coreutils/2023-03/msg0000...

https://lists.gnu.org/archive/html/coreutils/2023-03/msg0000...

https://lists.gnu.org/archive/html/coreutils/2023-03/msg0001...

> An edge case is we may want to support allowing to specify a NULL argv[0].

So the GNU developers are aware of this issue. Perhaps they make use of it in other GNU software. Not sure.

It seems my patch wasn't accepted to the coreutils. I offered to add support for NULL argv, they said they were deciding what to do but I never got a reply. So env doesn't have this functionality yet.


No, that would be the first point. The last point is retaining the status quo simply for the sake of it, which sometimes happens but quite often does not, or out of a misinformed idea of what historically has actually been the case, which happens sometimes too.

It's worth noting that being correct according to the SUS would have avoided the problem, since the SUS calls out the applications-code assumptions that cause the vulnerabilities in its rationale.


> Someone else then comes up with a mad but very real mainstream program that relies upon the system call succeeding.

Mandatory xkcd reference https://xkcd.com/1172/


In this case, it was a kernel filesystem driver test suite, and the test suite isn't in fact relying at all upon argv being a zero-length array. It's not part of any workflow.

Indeed, the test suite is, rather, this very bug waiting to happen. Still. Two years later.

https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/...

Easily solved by:

    static char *argv[] = {
        FILE1,
        NULL,
    };
The xfs test program invoked by the execve() is calling getopt_long_only() passing the zero argc and a zero-length argv that execve() was given. It is quite ironic that that test program is thus exceeding the argument vector bounds and going off into some adjacent part of memory, because one of the first things that getopt_long_only() does is:

    optind = 1; /* Don't scan ARGV[0], the program name.  */


In modern programming you would reject these arguments as invalid and refuse to perform the execve syscall with something like EINVAL. Instead we now have this additional quirk that's going to be part of the Linux lore forever and people may start relying on as a feature.


At the end of the day, any system used in the real world is going to have "lore", odd quirks that only exist because of historical reasons. Sure we can try to prevent it with good programming practices, but some things are usually only obvious in hindsight, and the real world is messy.


It's all easy if you can throw backward compatibility away after every release and tell users to fuck themselves.


I, for one, would love for the Kernel to tell users to f themselves sometimes an break compatibility. In that order.


Then nobody will want to write software for your OS. But yeah supporting shit software that connect with your software sucks


In modern programming: Someone a dozen dependency-of-dependency-of-dependency steps removed from you reads an RFC tome they found under the sofa, promptly changes the API to return EINVAL for a call that previously succeeded and breaks half the world. But at least it's modern, quirk-free and pure of thought.


I think this happened recently on macos: the libc string parsing functions started complying with the C spec and broke some code (so you now need to check errno and the return value if you want to know the result was valid or not)

I don't think it was deliberate, I suspect they may have switched libc or something.


>First, you can pass in a zero-length argv (ie, where argv[0] is NULL), in which case the executed program will have an argv[0] that is NULL. A variety of programs will then be unhappy with you, as people discovered in CVE-2021-4034. This option exists more or less because this API was easy back in the old days of Unix.

Input validation, unsound assumptions, relying on magic values (NULL termination)

Boring stuff still causes damage


Classic array bounds error. Obviously, if argc == 0, you should not dereference anything in argv.

It also shows why C is such a hard programming language. There is nothing to help the programmer in this case. The compiler doesn't know that argc is the length of argv, so there is not even a warning if you do it wrong.


> Obviously, if argc == 0, you should not dereference anything in argv.

Quoting ISO C17:

> argv[argc] shall be a null pointer.

Not constrained on argc being greater than zero (unlike the points that follow).

So argv should always be dereferenceable within the program. It's an other question if this should be enforced by the Linux kernel or the C runtime.

https://cigix.me/c17#5.1.2.2.1.p2


You are looking at the wrong standard. Look at the one for the operating systems' APIs.


If you are writing a C program, then you should consult the C standard, and possibly POSIX. The Linux kernel + C runtime + C compiler should form a conforming implementation of the C language.

Now if the Linux kernel doesn't promise to pass a non-NULL argv (and documents so), then it should be the responsibility of the C runtime to fix that once the program reaches "main". But as I wrote earlier, it's a whole other question.


So now go and actually consult the POSIX standard, as I just said. You clearly haven't.


Why does it even matter what the POSIX standard says? The C standard dictates that `argv[argc] == NULL`, and hence argv shall always be differentiable. Thus, it is very clear that the C standard was violated by the previous Linux+crt. What POSIX says is entirely irrelevant here - if I write a standard C program, not a POSIX program, and it fails to run on the platform, then that platform has violated the C Standard.


Because you haven't grasped it either. No-one is saying that argv[argc] != NULL. But in the case being discussed by M. Siebennmann, argc is 0.

The Single Unix Specification says, very clearly:

> The argument arg0 should point to a filename string that is associated with the process being started by one of the exec functions.

> The value in argv[0] should point to a filename string that is associated with the process being started by one of the exec functions.

Then it goes on at length in the rationale at the bottom of that same section about "the use of the word should" and even explains how programs are tripped up by argc being 0.

As you can see, both masfuerte and planede have still clearly not read this, despite that it's said twice in the description, and then has two entire paragraphs devoted to the reasoning underpinning it in the rationale.


I really don't know what you're trying to argue here.

Yes, that spec places additional requirements on top. But looking at just the C spec is apparently enough to find a violation here, and that's relevant! (Especially because the C spec doesn't say "should".)

People wanting to point that error out doesn't mean they're failing to grasp anything.


Think it through. Yes, both you and xe haven't grasped this. M. Siebennmann's scenario is that argc is 0, and argv[0] is NULL. There's nothing in the quoted

> argv[argc] shall be a null pointer.

constraint that that violates. argv[argc] is a null pointer. You have not found a violation of that constraint.

The constraint that is violated is, rather, in the Single Unix Specification. It's the SUS that puts the constraint upon how execve() may be invoked by a strictly conformant application, and requires that (as it says twice) argc be "one or greater", because it constrains the first element of the argument vector to be a pointer to a string, not a null pointer. The _only_ leeway is the wiggle room in "associated with", which it devotes an entire paragraph to explaining.

More than one standard applies. The idea that only the C standard covers the writing of these programs is wholly wrongheaded. After all, the C standard also allows non-POSIX implementations.


There are two scenarios here.

One scenario is that argc is 0 and argv[0] is NULL.

The other scenario is that argc is 0, argv is NULL, and argv[0] does not exist.

The C standard makes the second scenario invalid, but Linux was allowing it.

This is all directly relevant to the first two posts in the comment thread, talking about "relying on magic values (NULL termination)" and "Obviously, if argc == 0, you should not dereference anything in argv. [...] shows why C is such a hard programming language". Even though they were talking about C by itself, that's actually incorrect, C guarantees the null termination and this rule was being broken by the OS.


While we are at it [0]:

> The meanings specified in POSIX.1-2017 for the words shall, should, and may are mandated by ISO/IEC directives.

By those directives "should" indicates recommendation. [1]

> In POSIX.1-2017, the word should does not usually apply to the implementation, but rather to the application. Thus, the important words regarding implementations are shall, which indicates requirements, and may, which indicates options.

You also have to differentiate between any requirements for the argv argument when passed to one of the exec functions and the argv parameter received in main. Any requirement for the former constrains the application, but requirements for the latter constrains the implementation.

exec is indeed only documented in POSIX and OS specific documentations, but main is also documented in ISO C. Both POSIX and ISO C constrains the argv parameter of main not to be null. They don't require argc > 0 or argv[0] != null.

From [2]:

> [under exec] The argument argv is an array of character pointers to null-terminated strings.

This is a hard requirement, without "should". This implies that argv must not be null itself when passed to exec. This probably gives implementations the freedom to do whatever in this case, POSIX doesn't define the behavior. The requirement for argv[0] (which implies argc > 0) is merely a recommendation.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd...

[1] https://www.iso.org/foreword-supplementary-information.html


Conversely the C parts of the POSIX standard is supposed to be a conforming extension to ISO C. POSIX even spells this out.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/V...


So I just looked at the POSIX standard for the exec functions. It says the argv array is terminated by a null pointer when it arrives at the C main function. But a null argv doesn't cause exec to fail. Which implies that the OS or C runtime must fix it up between exec and main.


I assume you looked up [0].

> [under int main (int argc, char *argv[]);] The argv and environ arrays are each terminated by a null pointer. The null pointer terminating the argv array is not counted in argc.

This confirms that the argv function parameter of main cannot be null.

I couldn't find the behavior specified for passing null for argv in exec*. As other comments pointed out some BSDs return with EINVAL. I think both this and Linux's behavior is compliant.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/functions/e...


You're right. I misread the list of errors as exhaustive.


But you don't have argc, only argv. The C stub will derive argc by traversing argv and then pass both to main().


"Obviously, if argc == 0, you should not dereference anything in argv."

I'd say that's not obvious. Normally argv is null-terminated so you can read argv[argc], and it's zero. There's no need to read argv[argc] if you've already looked at argc but sometimes it's more convenient to rely on the sentinel value.


In my opinion that is like saying it might more convenient to use sprintf.

Yes, that null has to be there for historical reasons. It doesn't mean it is a good idea to write code that relies on it.


> Yes, that null has to be there for historical reasons. It doesn't mean it is a good idea to write code that relies on it.

That null has to be there because if it isn't then the implementation is broken, not the program.

If you are coding defensively because the implementation might be broken, then you can't rely on argc being positive either


It seems that (some version of) the C standard contains these words:

argv[argc] shall be a null pointer

So I wouldn't feel too bad about relying on it.


Wait what bugs are caused by argv's null termination? Aren't all these bugs caused by the fact that argc can be 0 and programs don't expect that, not that argv is null terminated?


Relying on argc instead of null termination is a bit more robust since the check will still work even if you skip entries for any reason.


I agree that using argc is better than null termination, but I don't understand how any of the bugs related to argc==0 are about null termination.

Note that the buggy code behind CVE-2021-4034 literally uses argc, not null termination.


It is ugly as hell

Whenever I see

>while char != NULL

Im a little bit sad. It doesnt feel robust.


See also: LWN's coverage from last year: https://lwn.net/Articles/882799/


Seems it's come up before: https://bugzilla.kernel.org/show_bug.cgi?id=8408

Rejected as "documented" in 2007. Also mentions a similar issue with envp.


Isn't Linux severely breaking the userspace with this change? What about programs that expect argc to be 0 or argv or argv[0] to be NULL?


They haven't been portable to other operating systems, and have been wrong, for at least 18 years at this point. They will have hit EFAULT or EINVAL on other operating systems. You can see the commit where Matthew Dillon introduced EFAULT and an argv[0] bodge into DragonFly BSD in 2005, for example, on this very page.


Well, when every possible change would cause something to be wrong you have to pick your poison.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: