Hacker News new | past | comments | ask | show | jobs | submit login
Banned C standard library functions in Git source code (github.com/git)
502 points by susam on Aug 25, 2019 | hide | past | favorite | 312 comments



I'm glad to see that setjmp() and longjmp() are still allowed.

I'm just kidding by the way. For those C programmers who haven't encountered these before, it is a powerful way to do a "goto" in C. Powerful in the sense that you can jump anywhere, not limited to the same function. If it's used at all these days, it's used for exception handling.

More info: https://en.wikipedia.org/wiki/Setjmp.h


The way I see the list is that it isn't meant to be exhaustive, it's meant to be functions that contributors are likely to try to use. A new contributor might not realize that they should be using Git's internal strbuf.h instead of the libc string functions. It's a way to give them feedback on that from their compiler, before they spend more time on the patch and send it to the mailing list.


Also seems mostly for functions which look innocent and are likely to be used but are very easy to misuse.

setjmp/longjmp is not in this category, it might be argued to be easy to misuse but doesn't look innocent and is not likely to be used in normal codebases.


Might be nice if the error messages suggested such alternatives instead of just saying "banned"?


I believe the git project welcomes suggestions and discussion via the mailing list, and also accepts patches that like you suggested above.


I had quite some fun playing around with it to obfuscate code. I did find that jumping to a different position in the same expression broke on a lot of compilers as soon as optimization was turned on.

Here's a simple program to output a string with the characters in an unusual order.

    #include <stdio.h>
    #include <setjmp.h>

    #define J(x,y) (longjmp(x,y),0)

    int main(int argc, char **argv)
    {
        jmp_buf j[011];
        int x,X=0;
        signed char* i = " eehcef lo u ef'oYurls " +021;

        (x=setjmp(j[X++]))?J(j[x],*++i):((x=setjmp(j[X++]))?J(x[j],*--i):((x=
        setjmp(j[X++]))?J(x[j],(putchar(*i),*i+=128)):(x=setjmp(j[++X]))?((x<
        0)?J(1[j],5):J(2[j],3)):(x=setjmp(j[++X]))?((x>>X)?J(1[j],6):J(1[j],3
        )):(x=setjmp(j[++X]))?((x&0x80)?J(1[j],7):J(1[j],5)):(x=setjmp(j[++X]
        ))?(((X<<5)&x)?J(1[j],4):J(2[j],X)):(x=setjmp(j[++X]))?((x&128)?J(0[j
        ],X):J(2[j],4)):(setjmp(j[3])>=0)?J(j[2],6):0));

        return 0;
    }


> the same expression broke on a lot of compilers as soon as optimization was turned on

C is not a "try it and see" language. If you stray from the spec then things will act differently and break between compilers. Section 7.13.2 specifies that only local variables declared as volatile will have a determinate value after a longjmp.


ISO C says that any local variables that are not declared volatile whose values are mutated after the context is saved with setjmp then have indeterminate values when the context is restored with longjmp. So yes, that interacts with optimization. Any variable that is cached in a register may revert to its old value when that register is restored blindly from jmp_buf.


You, sir, have a crafty and evil mind. Well done.


This was super fun to decipher!

1. The first time through this giant ternary defines 9 different small subroutines. Each subroutine can consider the value of "x" to be their argument. X is used as an index into the jmp_buf (subroutine) array during setup, but after that it's always just equal to 8

2. Subroutines 0, 1, and 2 are special. Their argument is the index of another subroutine to jump to. They mutate the character buffer state and then jump to the specified subroutine

3. Before jumping, subroutine 0 moves back a character, subroutine 1 moves forward a character, and subroutine 2 prints the current character and marks it by setting its high bit. Once a character is marked, it will never be printed again

4. All other subroutines get the current character value as their argument

5. Subroutines 3, 4, 5, 6, 7, and 8 all check, in different ways, whether the current character has been marked: x < 0, x >> 8, x & 0x80, (8 << 5) & x, x & 128, x > 0

6. Execution starts at subroutine 3

So you can annotate the code as:

    // Subroutine 0:
    //   move to next char;
    //   goto x;
    (x=setjmp(j[X++])) ? 
     J(j[x], *++i) :
    
    // Subroutine 1:
    //   move to prev char;
    //   goto x;
    ((x=setjmp(j[X++])) ? 
     J(x[j], *--i) :
    
    // Subroutine 2:
    //   print and mark (set high bit on) curr char;
    //   goto x;
    ((x=setjmp(j[X++])) ?
        J(x[j], (putchar(*i), *i+=128)) :
    
     // Subroutine 4:
     //   if (curr char is marked) {
     //     // goto 1(5);
     //     move to prev char;
     //     goto 5;
     //   } else {
     //     print and mark curr char;
     //     goto 3;
     //   }
     (x=setjmp(j[++X])) ?
        ((x<0) ? J(1[j],5) : J(2[j],3)) :
     
     // Subroutine 5:
     //   if (curr char is marked) {
     //     // goto 1(6);
     //     move to prev char;
     //     goto 6;
     //   } else {
     //     // goto 1(3);
     //     move to prev char;
     //     goto 3;
     //   }
     (x=setjmp(j[++X])) ?
        ((x >> X) ? J(1[j], 6) : J(1[j], 3)) :
     
     // Subroutine 6:
     //   if (curr char is marked) {
     //     // goto 1(7);
     //     move to prev char;
     //     goto 7;
     //   } else {
     //     // goto 1(5);
     //     move to prev char;
     //     goto 5;
     //   }
     (x=setjmp(j[++X])) ?
        ((x & 0x80) ? J(1[j], 7) : J(1[j], 5)) :
     
     // Subroutine 7:
     //   if (curr char is marked) {
     //     // goto 1(4);
     //     move to prev char;
     //     goto 4;
     //   } else {
     //     print and mark curr char;
     //     goto 8;
     //   }
     (x=setjmp(j[++X])) ?
        (((X << 5) & x) ? J(1[j], 4) : J(2[j], X)) :
     
     // Subroutine 8:
     //   if (curr char is marked) {
     //     // goto 0(8);
     //     move to next char;
     //     goto 8;
     //   } else {
     //     print and mark curr char;
     //     goto 4;
     //   }
     (x=setjmp(j[++X])) ?
        ((x & 128) ? J(0[j], X) : J(2[j], 4)) :
     
     // Subroutine 3:
     //   if (curr char is not marked) {
     //     print and mark curr char;
     //     goto 6;
     //   } else {
     //     exit;
     //   }
     (setjmp(j[3])>=0) ?
        J(j[2],6) : 0));
    
    // X = 8
    // goto 3(0);
A more straightforward way of writing this would be

    #define IS_CURR_CHAR_MARKED() (*i < 0)
    #define PRINT_AND_MARK_CURR_CHAR() do { putchar(*i); *i += 128; } while (0)
    int curr_sub = 3;
    
    for (;;) {
        switch (curr_sub) {
            case 3:
                if (!IS_CURR_CHAR_MARKED()) {
                    PRINT_AND_MARK_CURR_CHAR();
                    curr_sub = 6;
                } else {
                    return 0;
                }
                break;
            case 4:
                if (IS_CURR_CHAR_MARKED()) {
                    i--;
                    curr_sub = 5;
                } else {
                    PRINT_AND_MARK_CURR_CHAR();
                    curr_sub = 3;
                }
                break;
            case 5:
                if (IS_CURR_CHAR_MARKED()) {
                    i--;
                    curr_sub = 6;
                } else {
                    i--;
                    curr_sub = 3;
                }
                break;
            case 6:
                if (IS_CURR_CHAR_MARKED()) {
                    i--;
                    curr_sub = 7;
                } else {
                    i--;
                    curr_sub = 5;
                }
                break;
            case 7:
                if (IS_CURR_CHAR_MARKED()) {
                    i--;
                    curr_sub = 4;
                } else {
                    PRINT_AND_MARK_CURR_CHAR();
                    curr_sub = 8;
                }
                break;
            case 8:
                if (IS_CURR_CHAR_MARKED()) {
                    i++;
                    curr_sub = 8;
                } else {
                    PRINT_AND_MARK_CURR_CHAR();
                    curr_sub = 4;
                }
                break;
        }
    }


That's quite impressive work. It becomes a bit clearer if you think of the string as a tape. Then a big case statement with Left,Right,Mark seems rather Turing complete :-)


> you can jump anywhere, not limited to the same function

Outside of the current function, yes. Anywhere, no.

Longjmp allows you to return to places you’ve already been and have saved with setjmp. Also attempting to longjmp to a function that has since returned is undefined behavior. Thus longjmp’s usual use case is like exceptions in other languages, going back up the call chain, skipping intermediate functions.


I think reasonable people can debate this one. setjmp()/longjmp() are very useful when working with libpng [1] and jpeglib [2], as you mentioned, as a crude exception handling mechanism. I have seen these functions used in production safely for that purpose. I can imagine other horrible uses for them though. Unlike things like strcpy() which are security-holes-by-design, setjmp()/longjmp() should be in a "carefully code review the exact way these are used" list rather than a banned list.

1: http://www.libpng.org/pub/png/libpng.html 2: https://www.ijg.org


The trouble is how they compose with the expectations of code you don't write. It's just not normal in C to expect execution of your function to abort halfway through, and so memory leaks and broken state are almost guaranteed when they're mixed with the wrong third party library.

It might work great when you wrote it, but it might not even survive the next change to the codebase, and 10 years out, who is to know what programmers will come along and plug new stuff in without noticing the stack magic, etc

edit: another aspect of the same issue is that it can be difficult or impossible to write generic longjmp-safe code for many kinds of tasks. In C++ you have stack unwinding to e.g. delete large temporary heap allocations during a complex operation, but no such facility exists in C.


longjmp will not interrupt third party code (well maybe if it uses callbacks and you use it from a callback).

setjmp/longjmp will almost certainly require tracking any resource use carefully so that resources can be released at the setjmp point, when execution gets back there.

I mostly use it for terminating the recursive descent parsing on error. Otherwise it gets tedious to check return values everywhere in hand written code.


> very useful when working with libpng

There is no reason libpng needed to use setjmp as part of its API, and that it does so is (I hope) widely regarded as a bad decision. It's "useful" only in so much as that is the API it defines, but there is no reason that this had to be the case, and simply returning an error code would have been very very much preferable.


Didn't glibc use setjmp()/longjmp() for task switching eons ago (maybe even pre-Linux)?


These are used /everywhere/ in PostgreSQL, exactly for exception handling. The result isn't bad at all, but indeed, too powerful a tool for 99% of developers


Fortunately, 99% of developers, like 99% of drivers, are better than average.


> 99% ... are better than average.

I'm thinking there's a flaw in your mathematics...


99% of the people on Earth have a higher than average number of legs.


And a below-average number of heads.


No flaw with math/stats here, that can actually happen. I think you're confusing average/mean with median (50-percentile).

PS: I'm not sure where he's getting his numbers though


I'm 99% sure that was tongue in cheek.


I think that it would have been even funnier to say that 99% of developers and 99% of drivers are better than the median, but I only figured this thanks to your parent's comment.


https://insights.stackoverflow.com/survey/2019#evaluating-co...

> 70% of respondents say they are above average


Depends which average it refers to any of the means, or the media, or the mode can be “average” (most typically, though, it means the arithmetic mean), and it's only impossible in terms of the median.


Exactly. Anything where you have a main clump with a long tail will show this pattern--most samples will be better than the mean.


You've never been on a project that was 80% done and 80% to go?


And after the second 80% is finished, then you only have 80% left.


Not native speaker here, but I think, he just meant that, most programmers and drivers thinks they are better in respective things then, they really are.

It is just like letting programmer build a house. It is just a really bad idea.


And Lua’s exceptions are somewhat odd because they are based on setjmp/longjmp. https://www.lua.org/manual/5.3/manual.html#4.6


Apparently they're also used for error handling in libpng. That's one of the reasons the libspng author started on their project.


And that's the perfect usage for goto. Error catching and jump to fail block...


Goto requires the fail block to be in the same frame. It can be a working error strategy for failing critical errors in such a way from a failing module. You have to be aware that lots of stuff is prone to leak (which can be avoided by i.e. using some memory arena) and locks/mutexes/... might be in unclear state, so code has to be aware of that.


The destination of the goto is usually the clean-up code.


That's nice for the single frame.

Let's take the example of PostgreSQL, which was mentioned in this thread: A potential implementation (I have no knowledge of pgsql internals) might be that for a single user session all memory is from a single arena, all locks and other handles are tied to the session handle, then it can be an efficient strategy to longjmp out on a critical error (i.e. IO error) and clean the arena as well as things tied to the session. instead of bubbling this up through all frames.


It's a different use case. There was a nice article[1] about exceptions in plain C with setjmp and longjmp.

[1] http://www.on-time.com/ddj0011.htm


You're in the 1% though right?

Funny way of saying "too error-prone for anyone sane to consider using".


It is also the only way for an xlib program to survive losing its connection to the X server. xlib will let you register a callback to handle loss of connection with XSetIOErrorHandler, but unconditionally calls exit() after your callback returns. Which is extremely anti-social behavior for a library, but that is what it does.

So the solution is to setjmp before each call to an X function that might notice the connection is closed, and longjmp out of the error callback. Ugly, but it works.

These days, you should really just use XCB, or even Wayland.


That's nuke-from-orbit level stupidity!


You can build coroutines from them, too, which is really cool and useful.


I did! And the ergonomics are dreadful!

https://github.com/jaroslov/coro


I don't know why I still remember these, but if you need to build coroutines in C, ucontext.h might be a better tool:

https://pubs.opengroup.org/onlinepubs/7908799/xsh/ucontext.h...


Those are deprecated.


I had only seen them in the context of userspace threading. I wonder what mainstream libs use them. Do you know of any?


Generators are a subset of coroutines, and coroutines are useful in lexing and parsing.


Whoops, I meant that use set/longjmp() to implement coroutines. My wording in retrospect was super unclear. Sorry!


I think you mean fibers, and no you can't.


Here is an implementation of coroutines on top of setjmp/longjmp: https://fanf.livejournal.com/105413.html


How does it work for arbitrary arguments? What about function pointers. What about calling a coroutine inside a coroutine?

It also has a stack managed in runtime. I'm not sure that would work for any other application than the fun example here.


Looks more like an implementation of undefined behaviour to me.


"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html


The underlying implementations define the behavior here very well.


Well, I would not go that far. My coroutine hack relies on “unwarranted chumminess with the implementation” (http://c-faq.com/struct/structhack.html) and there are lots of incidental things that can break it. It is best treated as an example of the principles of thread switching and stack allocation.


I don't use setjmp() and longjmp() often, but sometimes I do; I use goto more often than those (but still not all the time). (I want goto in JavaScript too. I figured out a algorithm to do so, but have not implemented it.)

One use of setjmp/longjmp I have used is in ZORKMID, to deal with the debugger. At the beginning of the execute() function I have:

  while(setjmp(exception_buffer));
(The semicolon is correct; the loop body is supposed to be empty.) Normally, if you enter the debugger and then you exit the debugger to continue the execution, then it will continue from where it left off, which is likely in the middle of the execution of some instruction, and may result the same error again. But, if you change the program counter before continuing execution, then the debugger will keep track of that and will use longjmp instead when continuing execution, therefore skipping the rest of the instruction that stopped.


What can you accomplish with goto that you can't accomplish with try/catch and custom errors?


> What can you accomplish with goto that you can't accomplish with try/catch and custom errors?

You can program in C. C doesn't have try/catch...


Goto sounds like a generally bad idea in a more expressive language like Javascript. Can you give an example of some Javascript-with-goto code that would be better than vanilla Javascript?


I have sometimes encountered such thing (but I do not remember right now), but what I do know is that goto is more helpful in C than in JavaScript. That doesn't mean that goto isn't worth anything in JavaScript, but only that there are less circumstances where it is useful compared to the other stuff. (The same is true of macros.)


I remember setjmp() and longjmp() being the only way to do a STREAMS device driver I wrote 30 years ago. Don't know if things have changed since then. Ah, what memories.


Could be wrong but I believe s lot of coroutine implementations use this as well since the call preserves the frame info in the result.


I had really fun implementing coroutines, (almost-zero cost) exceptions, iterators etc with setjmp and longjmp. I think they're too useful to be banned outright (especially since C has no view on stack) since there are cases where you need to either use them or inline assembly, but I agree that they're the evilest of evilest std C functions.


I've always wondered why goto is actually considered harmful. I really should read the original paper by Djikstra...


Read David Tribble's "Go To Statement Considered Harmful: A Retrospective" instead: http://david.tribble.com/text/goto.html . It contains the entire text of Dijkstra's paper and goes over its meaning in a modern context on a paragraph-by-paragraph basis. It's very well written and makes clear the "encourages spaghetti code" argument, while true at the time, applied in a era where basic control flow constructs that we now take for granted did not exist in commonly used languages.


Thank you for the recommendation; that was a very interesting read. I find that I can't quite agree with the last "nontrivial goto" example, though. This version without explicit gotos is actually shorter, and IMHO not really any more difficult to follow:

    int parse(void)
    {   
      Token tok;
      while ((tok = gettoken()) != END)
      {   
        while (!shift(tok))
          if (!reduce(tok))
            return ERROR;
      }
      return ACCEPT;
    }
I will say that any language without goto, or where the use of goto is discouraged, should at least provide guaranteed tail-call elimination as an alternative. In this case the gotos could easily be converted into loops, but not every algorithm is so accommodating.


See also Structured programming with go to statements[0] by Donald E. Knuth.

[0]: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.103...


It encourages spaghetti code.


Not using it can also create spaghetti code. For example the "alternatives" often suggesting for exiting a multi-tiered loop (which often add additional boilerplate, additional variables, etc that all need to be maintained/bug free).


Indeed. Despite not having actual goto, Java went out of their way to have labeled loops and the "break label_name" syntax in order to avoid this awkwardness.



libpng uses setjmp and longjump as part of its freakin' API.


There really are use-cases for it, especially when implementing a Forth-style interpreter in C.


> If it's used at all these days, it's used for exception handling.

And in all of the CPS code.


> If it's used at all these days, it's used for exception handling.

What about co-routines?


Yeah, I remember some green threads libraries based on longjmp/setjmp.


I see zero use of those in the codebase, so perhaps they should be banned too. Or maybe they’re not on this list because nobody was using them already?


I have worked in large code bases (RSA NetWitness, Oracle Database, etc.) where setjmp/longjmp were used to create macros for exception handling using TRY-CATCH-ENDTRY blocks in C.


Personally, I feel like wrapping exception handlers in macros makes it easy for people to do unsafe things with it like dynamically allocating memory or calling system functions that can’t handle it correctly.


I see a lot of comments to the effect of "shouldn't XYZ also be banned". The answer is that we're not necessarily trying to be exhaustive. The point is to flag common errors before we even hit review, so we add new functions mostly when somebody tries to misuse them. I don't recall anybody trying to abuse longjmp() in Git's codebase yet (and no, that's not a challenge).


Peff, for the people asking in the thread, is there a place where correct alternatives are suggested or demonstrated?

I know there are a few different places that talk about how to use git's internal machinery, but not sure if any are specific to these banned functions.


The original commits mentions git's strbuf API[0] and its xsnprintf, a variant of snprintf which asserts that the destination buffer was big enough[1] (rather than just return truncation information).

For other codebases, snprintf is the usual recommendation, and careful straight buffer manipulation (mem*) iff performances are a concern.

[0] https://schacon.github.io/git/technical/api-strbuf.html

[1] https://code.forksand.com/linux/git_git/commit/7b03c89ebd103...


The commit messages that add them to banned.h discuss alternatives, though most of the explanations are Git-specific and assume you'll look elsewhere to figure out how to actually use those alternatives.



Like essentially all C programs.


Every C programmer has his/her own standard library...


I wonder about this every time I read C.

There is talk about npm/rust having massive dependency trees (and they do) because it makes taking on dependencies too easy. But I feel like C is on the opposite side of the spectrum, where managing dependencies is difficult so every C code base is rolling it’s own version of everything.

C also has an expansive standard library but it hasn’t offset re-invention of that stdlib in the ecosystem. I just see those implementations trapped inside code bases that don’t export them in a consumable way for other projects...

Personally I love writing C but rarely do because dependency management makes innovation in those code bases so time consuming for me.


It's bad that C devs have rolled their own awful, buggy custom solutions.

However, here's the upshot-- that awful, buggy, custom solution is guaranteed to remain compatible with the codebase that contains it. You won't find a single instance where such a dev "improved", "refactored", "optimized", or "modernized" that awful, buggy code in a way that stopped the rest of the code from working and then shipped that broken blob of junk. And if you did find that, there's no way they'd convince the rest of the devs that this is a useful thing to do in the interest of "moving forward" or whatever.

On the other hand you'll find zillions of dollars spent on systems to keep dependency management tools from causing exactly that problem.

Just to give a real-world example-- a user reported that an abandoned C++ plugin wasn't working. We did a race:

1. Three devs try to get a single C++ dependency of that plugin to work cross platform (OSX, Windows, Linux).

2. I tried to port the C++ plugin to a C plugin with no deps.

By the time I ported, tested, and shipped, those three devs were tracking down a bug with the build script of the C++ dependency on Windows.

Edit: granted the plugin itself is only about 2000 lines of code.


Regarding remaining compatible, I would argue that this is why modern languages implement lock files and version pinning. If you don't like a particular change but need some extra functionality or a security fix you can fork the relevant libraries or extend the functionality with an extra self written library.


It's not a guarantee that it will work, it's not that rare to encounter a minor/patch version in a library that introduced some subtle incompatible change (usually unbeknownst to the author)


That's literally the purpose of the lock file. The lock file locks the entire dependency tree. So unless you're bumping versions or you fail to save the lock file, the entire dependency tree's versions will remain the same.

>some subtle incompatible change

In statically typed languages this normally isn't an issue. Of course I'm aware that logic can also be changed, but in that case it's up to you to write appropriate tests (or just don't bump the versions of your libraries without a good reason).


It still sucks. I have a few Haskell projects from a few years ago that I wanted to compile on another system. So I froze the dependencies, moved the project, and tried to build. Solver failed. I gave up.


As far as I'm aware, Haskell's dependency system doesn't have a lock file. That's the key part to keeping things stable. Version pinning alone isn't enough.


You had the unfair advantage of working on your own :)


This is a C++ dependency manager problem though.


That's the point


From a modern package manager perspective, C++ is almost as archaic as C.


That was very true for a long time. But now with cmake and a package manager like hunter, it became quite easy and pleasant to work with dependencies. ExternalProject_Add() may also work.


Creating a lightweight library would make C programming much more productive, and I'm sure many have tried, but I wonder why none of them see widespread applications, instead of writing some half-baked, ad-hoc helper functions.

One reason, I guess, is the diverse range of applications of C, another reason is the lack of advanced features like generics and templates.

But it would still be useful to create a simple library, intended for Unix-like platforms (beyond the BSD extension of stdlib). So it's more of a cultural problem?


If I had to guess why there are no such lightweight library, it is because when somehow the library API doesn't provide what you want, you cannot hack the library itself easily. If you import the source code you may have trouble building it, and maintening such "patch" is not a lot of fun.


Yes. I guess the fact C doesn't have advanced features like generics and templates probably have made the problem worse, making it difficult to create a "drop-in" API.


There is a powerful C library that covers nearly all use cases, you merely need to write a configuration file for it. Perhaps you’ve already heard of it? Tcl.


All of them are null pointer exception & segfault free I bet.


Due to the language's design, C programs are in fact guaranteed to be free of exceptions.


Except SIGFPE.


I nominate your comment for the HN comment of the day.


Nitpicking but you are right. Null pointer bugs (null pointer dereference to be precise) are all over the place in C though.

https://stackoverflow.com/questions/4007268/what-exactly-is-...


What's a good library for this kind of boilerplate? A lightweight one if possible, i.e. not fucking glib


Salvatore Sanfilippo, from Redis fame, has a nice one https://github.com/antirez/sds.


Thanks for linking this, it's always good to see how very strong engineers solve problems. Never in a million years would this approach have occurred to me. For the interested, the README in the linked repo is done well and clearly explains the approach itself as well as pros and cons. This also reminded me that I've been meaning to watch antirez' "Writing System Software"[0] videos on YouTube for quite a while now, so thanks for that too.

[0]: https://www.youtube.com/playlist?list=PLrEMgOSrS_3fghr8ez63x...


I'd rather make it explicit that returned/passed value is something other than a regular zero terminated string. Otherwise you can easilly make a mess, if you pass zero terminated string to a function that expects some special bytes in the memory before the pointer.


I do this, but I like to just return the actual struct and also avoid typedefing pointer semantics. Can double as a dynamic array too. It's true. We all have string libraries...


I haven't looked much so I can't comment on the quality of the library but that hairdo is glorious. Props to Salvatore!


There are far too many. APR is a solid choice -- very portable, well-tested, and fast. It uses memory pools a lot, which work well for many styles but not everything.

See https://apr.apache.org/docs/apr/1.7/modules.html



What's wrong with glib, other than not being lightweight? I love glib.


libbsd has strlcat and strlcpy.

or just use snprintf everywhere.


Rust. Half-joking.


That was my first thought. No strncpy? Ok,

  while (*s1 && n--)  {
    *s2++ = *s1++;
  }
  *s2 = '\0';
it is, then.


I don't think strncpy is the function you are looking for, either. Its for zeroing fixed-size character buffers. And it doesn't indicate problems.

https://blog.liw.fi/posts/strncpy/


Nice. Is there a similar standalone C library for safe/sane string handling? (Please don't tell me to use C++.)




Windows programmers are familiar with StrSafe.h, https://en.wikipedia.org/wiki/Strsafe.h and https://github.com/dotnet/coreclr/blob/master/src/pal/inc/st... which go a lot further.

Also found https://github.com/mubix/netview/blob/master/banned.h on Github with a better list.


IIRC, StrSafe.h - and much more - came from the development of the Microsoft Security Development Lifecycle (SDL).

Code Red was the wake-up call for Microsoft and in February 2002, based on a memo from Bill Gates that first coined the phrase "Trustworthy Computing," Microsoft shutdown Windows development for the first time ever to get a handle on the security issues the products were facing.

https://www.itprotoday.com/strategy/story-behind-microsoft-s...


Additionally we get to enjoy bounds checked arrays(std::...), and iterators on debug builds, with possibility to selectively enable them in release mode.

While Windows by all means still has its security issues, the toolchain is much more security oriented than most FOSS alternatives thanks to the Windows XP wake up call.

Android and ChromeOS are probably the mostly locked down alternatives on the FOSS space.


Tangential but the fact that I had to open up the machine and remove a screw to completely replace ChromeOS with linux bothers the fuck out of me.


That's how secure boot should work. Replacing the root of trust should require serious physical access that can be tamper-evident. And yeah, out of the box, the trust is with the vendor — who else would be trusted in a device that doesn't have an owner yet?


I never thought of that. And it makes a lot more sense in that context.


Well, you can see how they are locking it down with containers sandboxing from an IO talk, by making use of gVisor and a Rust based containers.

"Linux for Chromebooks: Secure Development"

https://www.youtube.com/watch?v=pRlh8LX4kQI


My Chromebook just made me enable developer mode.


Developer mode lets you boot another OS, but leaves you open to accidentally wiping your drive if you hit the wrong button on boot. Replacing the firmware is a good idea, you do need to remove the write protect screw, but that really makes sense for some note of physical security.


Multiple wrong buttons in a specific sequence, at least on my Chromebook.


I don't know, you can use valgrind and fsanitize and ibstdc++ __gnu_debug:: containers or _LIBCPP_DEBUG under libc++.

Just know your tools...


The big difference is that on MSVC++ those debugging features are enabled by default on debug builds.

File->New C++ Project, already there.

Available to everyone, regardless of their compiler switch mastery level.

Defaults matter.


You can find an up-to-date, authoritative version of banned.h here: https://github.com/x509cert/banned


That's a surprisingly small list, missing e.g. sscanf / gets / strtok / all the other "usual suspects" at least


I don't see a huge problem with strtok as long as it isn't a multi threaded program. It is not in gets territory which is literally impossible to use correctly.


Synchronous reentry is a problem. Say some loop using strtok calls a function... how do you know that that function doesn't use strtok (or call another function that does...).

Any use of non-const static variables in general has this problem, and strtok is just one example.


> Say some loop using strtok calls a function... how do you know that that function doesn't use strtok (or call another function that does...).

I’m constantly surprised that C isn’t specified in such a way that lack-of-reentrancy can be determined at compile time. You’d “just” need a symbol table, sort of like the debug symbol table, in each compiled object, holding for each visible symbol the set of function calls marked `__no_reenter` that are predominated by that symbol in control flow. (Yes, some functions do computed jumps, like with longjmp. Just implicitly label those functions __no_reenter unless the programmer explicitly labels them __reenter!)


The problem isn't exactly reentry. It's that the function, having already exited, has modified global state, and it needs the same state to be there on the next call. If another call is interleaved (another callee with its own strtok loop) then the state is overwritten with the inner loop, and your own next call won't have its expected contents.

Whereas with a term like reentry I think of two strtok frames on the stack at the same time, which is not possible.

I think it's a bit easier to conceptualize with a function that produces a simple return value in a static buffer. Take getpwnam(). It fills a global structure with info about a given user. So you may keep a pointer to that on your stack. Then you call some other function foo(). foo looks up another user. Suddenly you can't rely on that other call to getpwnam() not having overwritten the data in the result you got back the first time.


Ah, yes. What I described wasn't reentry. A reentrant function using strtok could exhibit this problem, but it applies more widely than that.


how do you know that that function doesn't use strtok (or call another function that does...).

The same way you know a lot of other things about the codebase: By reading and understanding it. Given what strtok() is used for, it's almost always going to be working on a single context at any one time.


First, many C projects are libraries that are written for reuse in other programs. So "the codebase" is an arbitrary, unknowable thing. We don't have the ability to look at all the other code that will eventually coexist in the same process.

Second, code changes over time. Assuming you inspect 100 lines of code to ensure that they don't call strtok() ... do you sprinkle comments all over as a warning to future programmers not to add any calls to strtok()?

Finally, assuming we could evaluate all of the code in a program at once, it is impractical to build non-trivial programs that way. Instead, we rely on contracts. The idea is that if a caller conforms to its end of the contract, the callee will conform to its end. This allows us to reason about correctness of some code without reading all the other code in the universe and thinking through all the potential code paths. Any function using strtok() would be presenting a rather ugly contract clause "this function destroys strtok's static state". Even worse, such a clause would be contagious, infecting any client of that function, and so on.


Might as well just use strtok_r everywhere and not worry about if you're using strok safely or not.


"I don't see a huge problem with insert thing here as long as it isn't a multi threaded program."

Usually a bad omen. :)


I can't remember what's wrong with it, but it's responsible for about eleventy billion CVEs over the years


You can safely use gets on trusted input. For example, your program might create a pipe or socketpair to communicate between the processes that result from a call to fork. The stdout of one process feeds into the stdout of the other process.

Unless you have an insane gets implementation that writes the bytes backwards, you can also safely use it on untrusted input. Simply place a guard page at the end of the buffer. You can do the call from a forked child that you let die, taking notice of the problem in the parent. On systems without threading (because of locks in libc) you can do a longjmp to recover.


git does use multiple threads for some longer operations, like gc and pack, so using strtok would be a disaster.


Maybe gets doesn't need to be banned because it got removed in C11 and at least gcc issues a warning on use.


I hope it grows through community effort. This is a good sanity check sort of thing.


I'm curious how sscanf is considered bad?


scanf(“%s”, …) can cause a buffer overflow.


It sucks that there's no scanf("%.*s"), where you give the buffer length as a separate argument. The fact that it's in a string literal also makes using a compile-time constant really ugly.


You can overflow buffers all sorts of ways in C.


Looks like the second vsprintf definition should be "BANNED(vsprintf)", not "BANNED(sprintf)". Not that it matters much, as it only affects the error message.


Yes, you're right. Patches welcome. We do our development on a mailing list; see https://git-scm.com/docs/SubmittingPatches for details.

However, if you're more comfortable using GitHub PRs, there's a gateway interface at https://gitgitgadget.github.io/.


I submitted a patch to fix this typo from my fork (which you can see at https://public-inbox.org/git/cab687db8315dd4245e1703402a8c76...).


Nice catch! You can send a "PR", but note that they have their own custom "PR" procedure, not the standard "PR" procedure in GitHub.

> Git Source Code Mirror - This is a publish-only repository and all pull requests are ignored. Please follow https://github.com/git/git/blob/master/Documentation/Submitt... procedure for any of your improvements.


Their "custom" procedure is the `git send-email` command, which has a lot more of a claim to be "standard" than GitHub's custom procedure.


Aside: GitHub’s refusal to provide a feature to close pull requests for mirrors is utterly ridiculous since it requires that people (or bots) go around and disable pull requests and point people to the patch submission process.


What's weirdest to me is that it differs from the policy on Issues, which can be turned off in settings.

I sort of get not allowing it, except that it's inconsistent.


> not the standard "PR" procedure in GitHub

If it is specific to one implementation, then it is not standard, pretty much by definition.


the word "standard" has more than one meaning. it's not standard by the definition of standard that you chose, but you can use your context cues to realize that it's not the definition of standard that OP meant.


I’m sure they’d welcome a patch.


These shouldn't be used either (_s are available in C11, _l tend to be locale-parameter thread-safety, and _r have sizes/state for bounds/thread-safety):

- strtok -> strtok_r / strtok_s

- asctime -> strtok_r / strtok_s

- gmtime -> gmtime_r / gmtime_s

- localtime -> localtime_r / localtime_s

- ctime -> ctime_r / ctime_s

- dirname -> dirname_r

- basename -> basename_r

- devname -> devname_r

- readdir -> readdir_r

- ttyname -> ttyname_r

- gamma -> gamma_r

- lgamma -> lgamma_r

- lgammaf -> lgammaf_r

- lgammal -> lgammal_r

- atoi -> atoi_l

- atof -> atof_l

- atol -> atol_l

- atoll -> atoll_l

- gets -> gets_s

- scanf -> scanf_l / scanf_s

- fscanf -> fscanf_l / fscanf_s

- sscanf -> sscanf_l / sscanf_s

- tmpfile -> tmpfile_s

- fopen -> fopen_s

- getenv -> getenv_s

- strdup -> strndup

- strcmp -> strncmp

- strlen -> strnlen

- (Multibyte/wide conversion functions without mbstate_t parameter)

- wcslen / wscnlen -> wcsnlen_s

- wcsncasecmp / wcscasecmp_l / wcsncasecmp -> wcsncasecmp_l

- strcasecmp / strcasecmp_l / strncasecmp -> strncasecmp_l

- bzero (use explicit_bzero, in some cases)

- calloc, realloc -> reallocarray (for arrays of non-byte items)

- memmove -> memmove_s

- strncat -> strncat_s

- strncpy -> strncpy_s

- srand / rand -> rand_r

There are others that are platform-specific. Thread-safety, internal mutable state (not thread-safe), and buffer-overflows are the primary concerns that aren't necessarily applicable in all situations.


Annex K was never very popular, is still very easy to misuse[0] and has pretty much been deprecated and slated for removal.

[0] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1967.htm#mi...


Back in my individual contributor days I got into serious trouble because of this. Where can I find it has officially been deprecated? I tried and I can still use it without getting any warnings?


Interesting to see the starting commit for that file, there's quite a bit of discussion and rationale too: https://github.com/git/git/commit/c8af66ab8ad7cd78557f0f9f5e...


What surprises me in C developers is that C exists for probably 40 years but they still don't have proper strings (not just pointers). In many cases there is no large performance penalty for storing string length, and checking it, but they still use pointers or a separate pair of variables for pointer and buffer size instead of single object.


Because they have a security culture that you only get errors due to "holding it wrong" or "every good programmer does it right", despite evidence on the contrary.

Plus added the fact that early C compilers generated quite lousy code on 8 and 16 bit computers, there is this idea to micro-optimize each line of code as it is being written, without any profiling feedback of it actually matters, rather cargo cult how writting code like X is faster than Y.

For example, outside 3D rendering and audio software processing, I never saw a visible impact (to the end user) of bounds checking.


> For example, outside 3D rendering and audio software processing, I never saw a visible impact (to the end user) of bounds checking.

Indeed the computation overhead on bounds checking is irrelevant for "cold" code, but consider this

1) Pretty sure you can get bounds checking when compilers detect that you're accessing a static array.

2) Otherwise, it's unclear how to devise a system that integrates bounds checking with C semantics. (Yes, that's unfortunate!)

3) Bounds checking does at least increase code size.


1) Totally optional from ISO C point of view, not all C compilers do do it, usually requires static analysis or specific compiler warnings to be enabled. Which not everyone does.

2) Solaris does it perfectly fine on SPARC thanks to tagged memory (ADI). Which Google in collaboration with ARM will make mandatory on future Android releases as well. [0]

3) It hardly mattered in MS-DOS and Amiga LOB applications developed in across Turbo Basic, Quick Basic, GFA Basic, Turbo Pascal, Clipper, so it matters even less nowadays unless we are speaking about PIC like hardware.

Regarding bounds checking I usually refer to Hoare's turing award speech, back in 1981:

"Many years later we asked our customers whether they wished us to provide an option to switch off these checks in the interests of efficiency on production runs. Unanimously, they urged us not to--they already knew how frequently subscript errors occur on production runs where failure to detect them could be disastrous. I note with fear and horror that even in 1980, language designers and users have not learned this lesson. In any respectable branch of engineering, failure to observe such elementary precautions would have long been against the law."

Or for that matter, the DoD Multics's B2 security evaluation[1], with several remarks how PL/I made the system safer with its string handling, pointer integrity validation and bounds checking.

As noted, yes there are niche cases where bounds checking does have an impact, but for a large spectrum of code that gets daily written it isn't the case.

[0] - https://security.googleblog.com/2019/08/adopting-arm-memory-...

[1] - https://multicians.org/multics-fer.pdf


1) It's not standardized but it's very simple to implement nevertheless.

It is my understanding that 2) doesn't have anything to do with C (so given hardware support, you can have it for free whether working in C or not. I think this kind of invalidates your point).

And also, 2) is not perfect, only probabilistic (a source I found says 94% likelyhood to detect OOB).

And it works only for the most basic situations where you use the system allocator and never subpartition these allocations. So, beyond these simple cases that we can get for free without any involvement from C semantics, I still maintain that it is unclear how to devise a reasonable a useful system to do bounds checking that can be added to the C memory model. We can make up an annotation syntax to cover many of the simpler cases, but these are hardly better than plain assertions (which I regularly use).

I doubt Hoare had a good idea to add general bounds checking on a low-level language like C, otherwise that would be standardized by now.


Pointer validation via hardware memory tagging has everything to do with C, because it is only due to C's shortcommings that millions of research dollars keep being spent to try to make it work.

In what concerns Solaris, and the requirements for future Android with ARM memory tagging, the system allocator is all there is, at least from official support point of view.

Hoare hardly needed to pursue such endevour, because all systems programming languages derived from Algol, like ESPOL, NEWP, PL/I, PL/S, PL/8, BLISS,.... were sane regarding bounds checking.

Hardware validation of memory is the only way to tame C, the alternative is to just dump the language, because as proven by ISO C11 dropping Annex K, very few actually care about making the language safe.

However given UNIX's dependency on C, it is also quite clear to me that in the next couple of decades C will be around, long after I am gone, and business opportunities to create companies on top of CVE exploits due to memory corruption bugs.


> Pointer validation via hardware memory tagging has everything to do with C, because it is only due to C's shortcommings that millions of research dollars keep being spent to try to make it work.

So, to restate, Solaris does not do it "perfectly fine". Thanks for making my point.

> In what concerns Solaris, and the requirements for future Android with ARM memory tagging, the system allocator is all there is, at least from official support point of view.

That's a pity, because if you're not doing your own allocators then you'll have to accept lock contention, extreme memory overhead (for smaller allocations, say <= 64 bytes), and you'll need to match every little allocation with a deallocation, instead of making e.g. custom pool allocators.

You're just not going to write a large infrastructure (i.e. performance-oriented) system in this way.


There are countless libraries that add higher level string functions and no end to higher level languages. C fills the niche where you want something higher level then assembly but lower level then Perl, ruby, python, etc. Sometimes you want or need to manage your own memory. Arduino is a good contemporary example.


Exactly. C philosophy is to use libraries and not put things like a better string library in the core functions of the language. It keeps the language relatively clean and easy to understand, unlike c++


So the C philosophy is to provide a bad standard string library that is not thread-safe and makes it impossible to write programs without remote code execution vulnerabilities, rather than have a better library with exactly the same functionality but with more security and safety "bloat"?

/snark

Yes, C used have a cavalier approach towards security in the past. But if you're asking why the standard library is not fixed yet, I think that instead of pinning it to some lofty philosophy, it's safer to say that good C developers realized long ago that the original C strings are a mistake. Most big C projects define their own string functions and often their own length-prefixed string types. The C standard committee just gave up on fixing this issue in the standard library, but this is not due to philosophy, but because of the impracticality to force a standard solution on this stage.


Bonus points when the libraries are so incompatible among themselves that require extra conversation steps, and then just get dropped 'cause "mind the performance".


When compatibility is needed, APIs should consume pointer+length pairs. It doesn't get better than that, in ANY language, in terms of simplicity and modularity.

Libraries like Qt with extreme lock-in are at the other end of the spectrum. If it works for them, that's nice. Doesn't work for me. I don't think using a string library is in the spirit of C programming.


Which means either pointer+length get packed into their own structure to avoid mix up errors, thus requiring conversion function calls across libraries, or they are given separately manually, thus opening the door to the copy-paste mistakes from pointer+bad length that they are supposed to protect against.

Qt is as locked-in as any LPGL 3 FOSS project is.


> Which means either pointer+length get packed into their own structure to avoid mix up errors, thus requiring conversion function calls across libraries, or they are given separately manually, thus opening the door to the copy-paste mistakes from pointer+bad length that they are supposed to protect against.

No, we were talking about compatibility/modularity. You're shifting the topic.

> Qt is as locked-in as any LPGL 3 FOSS project is.

I'm not speaking about licenses lock-in, but about lock-in from an engineering point of view. Are you aware of any significant Qt projects that don't have "Q" all over their codebase?

(And yes, by contrast to GPL, I believe you can use LGPL libraries without suffering a terrible amount of (license) lock-in)


How am I shifting the topic?

Compatibility/modularity doesn't happen in the air, rather in written code.

So either one passes structs around, and somehow they need to be compatible.

Or one passes pointer + lenght as two separated variables, with the consequences to keep in sync two unrelated variables, from the compiler point of view.


Arduino uses C++.


I've used talloc [1] where appropriate to greatly simplify memory allocation and string handling.

[1] https://talloc.samba.org/talloc/doc/html/index.html


C has _only_ pointers for variable size things, not just strings.

Roughly speaking C vars are either known fixed length, or accessed via pointer. There is nothing else.

(except arrays -- which are mostly pointers)


The thing you are pointing to could have its length prefixed. I've always assumed that this isn't the case because nobody could commit to the size of the length prefix. Is it string8, string16, string32, or string64? Using a sentinel value to denote length is less opinionated and more portable.


And also unbeatable in memory efficiency.


Unfortunately, it's often algorithmically less efficient and more error prone, so everybody ends up replicating what git is doing with 'strbuf' but in slightly incompatible ways. The effect makes dealing with strings in C unnecessarily unpleasant.


For anybody else:

Why is strncpy insecure?

https://stackoverflow.com/questions/869883/why-is-strncpy-in...

> strncpy() doesn't require NUL termination, and is therefore susceptible to a variety of exploits.


It also fills all remaining bytes (if any) with nuls. ‘N’ in its name is not the same ‘n’ as in snprintf etc. It could be named fldcpy, as it works with 0-padded fixed-width fields rather than 0-terminated strings.


Yes. Also as mentioned in the post, strlcpy (when it's available) is safer.


But still not safe, thanks to the return value. Imagine the case where you are using strlcpy because you can't be sure if the source buffer is properly null terminated.


If your buffer isn't a NUL-terminated, then don't call a function that is only defined for NUL-terminated buffers. It's as simple as that.

I'm baffled by how some people claim strlcpy() is 'broken' or 'not safe' because it doesn't handle non-NUL-terminated inputs; the exact same thing applies to just about any function in the C standard library that takes strings as input. Are functions like strchr(), fopen(), printf(), strstr(), setenv() 'broken' as well?


> If your buffer isn't a NUL-terminated, then don't call a function that is only defined for NUL-terminated buffers.

It's not that people want to pass strncpy source buffers that lack NUL termination, it's that strncpy in certain situations will not NUL terminate its results.

https://begriffs.com/posts/2019-01-19-inside-c-standard-lib....

> some people claim strlcpy() is 'broken'

Speaking of strlcpy, it thankfully doesn't have the problem that strncpy does. However strlcpy is not in the C standard or in POSIX, so can't be used portably. In C99 snprintf is a better choice.


Yes, strncpy() not NUL-terminating it's output is nasty and well-known. But the comment I was responding to was claiming strlcpy() being unsafe.


Yes, because cutting off the trailing NULL is a threat vector every C programmer needs to be conscious of whenever they're dealing with strings.

But it's also a problem because maybe the programmer is using strlcpy to grab the first five lines of a 10TB memory mapped file. If you're not thinking about the implications of that return value it can be a real surprise.


This file could be made much better by having comments pointing to reasonable alternatives to the banned functions.


Why is there no brief explanations in this code why each function is banned?


A C programmer reading this list knows exactly why they are there. It is not at all a controversial list.

Edit: I guess some down voter doesn't believe me but it continues to be true. They are all string functions. Most of them do not take an output buffer size, so a source string exceeding the destination buffer will overflow. Others, like strncpy, take an output buffer size but will not null terminate when exceeded, so the buffer size must be reduced by 1 by the caller and potentially manually terminates, which is too easy to not consider. But any C programmer with significant experience already knows these pitfalls. So it's not controversial.


You explained it in 2 lines that would make for great comments.

It seems useful to educate C programmers who don't have "significant experience" because, well, by definition, not every C programmer has "significant experience" and they are the ones who would benefit most from learning this.

It's not about being controversial: nobody's going to disagree with your explanation (it is what it is - it's not an opinion).


I don't remember how I learned this. But it happens pretty naturally working with C a lot and doing things like reading documentation and code. To be clear this is not elitism or something of the sort, I am not at all pessimistic that new people won't learn it. It just needs to happen in its due time.

So I would say... What good is a verbose comment to explain? Somebody who doesn't instantly understand why it's banned can read the big warnings on the manpage, or they can google it and land on good explanations on sites like Stack Overflow, or they can look at the function signatures and come up with a correct guess. Then boom. They know. And from there they can assess other interfaces and see if they suffer the same weakness.

This ability tends to come from simple exposure.


If every C programmer knew, then there would be no need to ban them


That's not true. People don't staunchly avoid situations that have pointless sharp edges. Putting up caution tape, despite everyone being aware of the danger, is very often a useful activity.


I actually wonder why they do it.

Linkers already warn for this stuff. Doing it this way requires every source file to include banned.h, which I would guess is done by including it from some other common header, but that's not fool proof either.


[flagged]



Does not apply. Nobody is born with C skills.

Competence can be measured objectively. Failure to understand array bounds is prima facie evidence of lack.


FYI that response insinuates that noname120 is not a C programmer...


[flagged]


Just open the docs. Seriously, first link in Google/DDG/Bing on topic of "<any c string routine> safety" will explain it to you. I am self-taught and I can see it.

Seriously, just looking at function arguments shows you, there is a problem under the hood. Reading docs also will point out.

It is open-source, read and learn from Git, Linux code and so on. If you do not have experience is sysdev or high security, just don't contribute there.

This is like saying, construction engineering is gatekeeping beginners, because you have to work for few years before you can start your own project.


I think you can. The whole point of being open source and using a project like cough git (since we're here), is that if you don't like it you can fork it. If you don't allow forks then I agree you aren't open source


Looks like they did.


That's gatekeeping. If a junior developer starts working on this project they might bother a more senior developer who has better things to do with their time, or inline the definition of one of the functions out of frustration.


Git, like the Linux kernel, is essential, basic infrastructure that everybody uses and that needs to remain reliable. It's not a volunteer training project for dilettantes and neophytes. Gatekeeping is a deliberate part of how these projects are run and the results speak for themselves.


Makes sense, thanks for taking the time to answer.


I'd be interested in knowing which part of the comment you don't agree with and why.



Thanks for posting these. My company has a similar list of banned C calls so I was already familiar with the rationale. The initial patch[0] in particular is eye opening though. 157 lines of rationale and exploration of alternatives for 22 lines of uncontroversial code.

[0]: https://github.com/git/git/commit/c8af66ab8ad7cd78557f0f9f5e...


There are. It was mentioned that they are available in the 'commit comments' on 'banned.h', but you can find them by blaming down to the original commit which introduced each line.

Here are some of them:

  - strcpy()'s rationale: https://github.com/git/git/commit/c8af66ab8ad7cd78557f0f9f5ef6a52fd46ee6dd

  - strcat()'s rationale: https://github.com/git/git/commit/1b11b64b815db62f93a04242e4aed5687a448748

  - strncat()'s rationale: https://github.com/git/git/commit/ace5707a803eda0f1dde3d776dc3729d3bc7759a

  - sprintf()'s rationale: https://github.com/git/git/commit/cc8fdaee1eeaf05d8dd55ff11f111b815f673c58


I believe they're all vulnerable to overflows and easy to exploit.

https://pointerless.wordpress.com/2012/02/26/strcpy-security...


> As many already know, strcpy(3) is a very unsafe function call in the C library (you should always use strncpy(3) unless you can be exactly certain about the number of bytes you’re going to copy beforehand).

strncpy pretends to be safe but fails to NULL-terminate and has other performance issues that generally lead me to believe it’s a security placebo. I would recommend using snprintf instead.


Whenever you write, you have to consider your audience.

I'd assume there are no comments explaining this because for the intended audience, it is self-evident. "People browsing Hacker News on a Sunday" was probably not the intended audience :-).


Just to make it clear. I'm aware of the fact that these functions have security implications. Nonetheless when such decisions are made, I believe it's important to explain them—at least briefly—directly in the code itself.


With the way they are banned, you see it only as a compiler warning on implicit declaration of function (or at linking stage, if warnings are suppressed; idk what exact build process is). Afair, there is no evidence of banned.h in an error trace. So you have to look it up by yourself anyway.

They could rename them as “sorry X is unsafe and creates too much trouble” maybe, but in today’s internet “X considered harmful” is a common search query that everyone in the field is expected to know.


Fair point. It seems that more and more projects use "-Werror" which causes the compilation to stop and show you exactly where the error comes from[1]. Second point is that searching for "is_a_banned_function" in the code directly points to this file.

[1] https://ideone.com/h0g8F1


Because Git's own developers know to use git-blame, git-show, and write good commit messages. Comments are not the only place to store meta-info about why code is the way it is.


Git is open source, it doesn't have its own developers, every developer is "gits own developers". They really should make this more explicit.


Of course Git has its own developers, just as the Linux kernel does. There's no moral, technical, legal, or organizational obligation for the maintainers of any open source project to accept or solicit changes from randos.


They probably don't want the source code to start resembling a Literate Programming example.


I don't think gatekeeping contributors and gatekeeping contributions are the same thing.


None are properly length checked. strncpy() is for "binary" strings and can omit the NUL in addition to excessive zero padding large buffers.


Collectively "because they're too easy to misuse (and even if used correctly, complicate audits)."


Read the manual!


Because it's git. It's obscure by design.


For context, this header file was introduced during the period of Microsoft's acquisition of GitHub.

Git's banned.h roughly approximates the banned functions according to Microsoft's Security Development Lifecycle: https://docs.microsoft.com/en-us/previous-versions/bb288454(...

It seems Microsoft once published their own banned.h, but this file is not readily available from the MSDN anymore.


What are you saying is the connection between GitHub's acquisition and this header? This code is from git, not GitHub.


I'm also confused reading that comment, but I suppose the intended link is that GitHub employees are significant git contributors?


Yes, I've seen other instances of Microsoft/GitHub making substantial Git contributions: https://devblogs.microsoft.com/devops/supercharging-the-git-...

Microsoft seemed to have been adopting Git already, so it may be that acquiring Github was merely part of a broader Git strategy that Microsoft has adopted.


FWIW, here's awesome-static-analysis > Programming Languages > C/C++: https://github.com/mre/awesome-static-analysis/blob/master/R...

These tools have lists of functions not to use. Most of them — at least the security-focused ones — likely also include: strcpy, strcat, strncpy, strncat, sprints, and vsprintf just like banned.h


Neat macro. I like the idea.

Suggestion: the macro should recommend a replacement.

“Sorry_sprintf_is_banned__use_FOO_instead”.


Delighted to see strncat there, it's a buffer overflow waiting to happen, especially as the name sounds like it's the 'safer version' where it takes a size argument, except it's not the size of the destination buffer, it's the remaining size, so unless you know this, it's wrong for every non empty buffer. strlcat (not standard) operates how you expect.


Whenever I review code and see strncat there, it's almost guaranteed to be a bug. This is for the simple reason that nobody remembers exactly what strncat does, and so misuse it.


What does Git use instead for copying strings? snprintf?

Edit: also interesting is a search for alloca: https://github.com/git/git/search?utf8=&q=alloca&type=


Git has an internal "strbuf" library https://github.com/git/git/blob/master/strbuf.h


Alternatively it also uses snprintf, though as a small internal variant (xsnprintf) which literally kills the program if the destination buffer is too small: https://github.com/git/git/blob/master/wrapper.c#L636-L650


If that header is ever accidentally included before any standard header, it's undefined behavior. :)


How? The preprocessor is ran before every thing else? It will overwrite the defintions in the header just like in the source file? You still would get a linking error. The only reason I couldn think is if the the stdlib headers also called undef for those functions. Moreover this is all during build not runtime.


According to C99, section 7.1.3:

    If the program declares or defines an identifier in a context in
    which it is reserved (other than as allowed by 7.1.4), or defines
    a reserved identifier as a macro name, the behavior is undefined.
It doesn't matter what the macro would expand to; simply defining the reserved identifier as a macro triggers undefined behavior. That doesn't mean it won't work, for some specific combination of standard headers, compiler, and program source code. It just isn't a strictly conforming program. A conforming implementation is allowed to flag this as an error, ignore the definition, or simply generate nonsense output.


Good to see the list is short and relatively sane compared to other "banned function" lists. Unfortunately, "too easy to misuse" is a slippery slope, and gets(), which is probably the best example of a function which is really broken by design, isn't on that list.

I'm surprised that "complicate audits" is given as a reason, because isn't this something static analysers (and I mean ones that actually analyse data/code flow, not dumb pattern-matchers) can easily detect? It's really just asking the question "how long is this/can this be" and following the data back to its origin(s).


As per ISO C11 standard, compliant compilers no longer need to support gets().


C aliasing rules can very quickly make "following the data back to its origin" very difficult.


Theoretically it's equivalent to the Halting Problem, but in practice I've not seen such difficulty; if aliasing does become a problem to the extent that following dataflow is difficult, I suspect there are already far deeper design flaws in the codebase.


If I had time, I'd make a library (perhaps musl-based?) that implements "libc but without all the parts you're not actually supposed to use."



That make me wonder how hard it would be to just strip the bad functions out of the library.


Well your std c library is generally just a collection of .o, .a, .lib files. So you could use gnu bin tools to remove symbols of your choosing. However good luck getting your system to build software because that would modify the stdlib for all things compiled on your system. You could also tell gcc or clang not to link to the std library and providr your own with just a few switches on invocation of the compiler


Other thought would be to put the dlls with unsafe functions in different location than the safer ones and throw up a warning when an application loads them.


Because of possible string buffer overflow?


Yes, and interestingly, because even when used correctly they "complicate audits". This is an interesting use of preprocessor macros, I'm strongly debating introducing something like this at work.


I'm not an expert in C, but then what's the issue with strncpy() or any "n" functions? It prevents overflow AFAIK. Also what is the alternative (memcpy?) and why?


strncpy() does not guarantee that the copied string would be terminated with a null byte ('\0'). For a call that looks like strncpy(dst, src, n), if there is no null byte in the first n bytes of src, the string copied to dst would also not contain a null byte.

Here is an example code to demonstrate the problem:

  #include <stdio.h>
  #include <string.h>

  int main()
  {
      char a[] = "01234567";
      strncpy(a, "foobar", 4);
      printf("%.8s\n", a);
      return 0;
  }
Here is the output:

  $ cc -std=c89 -Wall -Wextra -pedantic foo.c && ./a.out
  foob4567
A C89-conforming alternative I use is a macro like this that guarantees '\0'-termination as the first thing:

  #define strcp(a, b, c) (a[0] = '\0', strncat(a, b, c - 1))
An example from my code: https://github.com/susam/uncap/blob/master/uncap.c#L78-L87

Here is how to use it:

  #include <stdio.h>
  #include <string.h>

  #define strcp(a, b, c) (a[0] = '\0', strncat(a, b, c - 1))

  int main()
  {
      char a[] = "01234567";
      strcp(a, "foobar", 4);
      printf("%.8s\n", a);
      return 0;
  }
Here is the output:

  $ cc -std=c89 -Wall -Wextra -pedantic foo.c && ./a.out
  foo


Your macro refers to a twice; it might be better as a function.


What has referring to a variable twice got to do with whether it should be a macro or function?


If the strcp() macro is used with a function as the first argument (or an expression that has side-effects), it's not obvious at the call site (or, probably, intended by the programmer) that the argument will be evaluated twice.

E.g. suppose we write something like

    char* buf = strcp(((char*)malloc(4)), "foobar", 4);
expecting this to copy the beginning of the string "foobar" into a newly-allocated 4-byte buffer. Oops... this will actually call malloc twice (leaking the first buffer), and it won't have written the intended '\0' into the buffer that actually ends up getting used, so all bets are off...

If strcp were a function, its first argument would be evaluated just once, and it would work as intended.



strncat is also banned, though.


strncpy have the danger of not putting a null character at the end of the destination , if source is longer than num.

> Copies the first num characters of source to destination. If the end of the source C string (which is signaled by a null-character) is found before num characters have been copied, destination is padded with zeros until a total of num characters have been written to it.

No null-character is implicitly appended at the end of destination if source is longer than num. Thus, in this case, destination shall not be considered a null terminated C string (reading it as such would overflow).


The issue is that strncpy doesn't guarantee that the destination string is null-terminated after a copy, particularly in the case where n is shorter than the original string, so you could run into overflows if you then ran strlen() or similar against it.

Here's a good explanation from Raymond Chen: https://devblogs.microsoft.com/oldnewthing/?p=36773


strlcpy() is a nice alternative, it's from OpenBSD from 1996. Was later added to the other BSDs, Solaris and macOS. But, last I checked, Linux never added them. :(

https://www.sudo.ws/todd/papers/strlcpy.html


strlcpy is a terrible alternative. I have never seen it used correctly. It takes more and uglier code to use it correctly than to use strlen and strcpy. People who use it don't bother.

Finding use of strlcpy in a program is a red warning of sloppy code.

There are good reasons it was kept out of glibc for so long.


I agree that strlcpy is braindamaged, but why in God's Green Earth isn't there a sensible replacement in stdlib in 2019? This shouldn't be that hard, yet the best alternative is I think snprintf, which is just so ugly. For a long time it wasn't used because compiler support was spotty, but it's been 20 years now so that shouldn't be a major issue anymore.


There is, it's strncpy_s and it was added in C11. And I know, nobody uses C11: that's a security failure.


strncpy_s performs a lot of checks that might not be useful, though.


The (more-or-less) sensible replacement is std::string. Adoption has been spotty, for practical, ideological, and fetishistic reasons.

The first step is to compile the C program with a C++ compiler. Next, start making improvements. If you skip the first step, the ceiling on improvements is limited.


C++≠C, and making the jump isn't always feasible.


As I said.

And when it's not, the ceiling on improvements is limited.


It doesn’t ensure that there’s a null terminator in the resultant string. This may potentially lead to a later overrun when the copied string is read back.

(See also https://linux.die.net/man/3/strncpy )


It is also important to teach newbies on the pitfalls of certain libc functions. Too much spoon feeding creates a bigger mess in the long run.


Huh, I literally just read through this file a few minutes before this was posted because I was randomly browsing the git source code.


I guess they care too much about portability to use "pragma GCC poison"?


To rephrase adrianN's comment:

It wouldn't mean breaking things on compilers other than GCC. It would simply add compile-time checks when building with GCC, and would do nothing on other compilers.


Wanting code to compile on something other than gcc is not "caring too much", it's being responsible


For this particular case it wouldn't be a big deal. The code is still valid C, it's just that GCC makes sure that you don't use the poisoned values.


Responsible? What an hyperbolic statement. Could you enlighten me on what would be the significant negative consequences? I hope that your mental equation take into account the missed usefulness of GCC only attributes, flags and extensions by being clang compatible.


I don't know what to tell you. Do you think all compilers should implement all GCC proprietary extensions, basically creating a parallel C standard?

And that's not saying that GCC extensions are not useful, but they should not be considered the norm


Agreed. But I wonder if we already have a sort of parallel C standard already with POSIX (of course the parts about C, not about OS facilities).


> strcpy ... BANNED

But they cannot ban while (* p++ = * q++); now can they.


It's not meant to be a substitute for humans reviewing patches. It's just meant to catch things new contributors are likely to try to use, and give them feedback on it before they send the patch to the mailing list.


Truly curious as I don't code in C, or any other low level language. What is the problem with it? That there might not be enough memory allocated at *q?


Not enough memory at *p. The loop performs the exact same function that strcpy does, and some people just write it out by hand.


It doesn't stop until *p is \0, so yes


Puts(),gets() would make a good addition?


I think gets() especially should never be used (but puts() is OK), but I think strcpy() and all of the other stuff they mention there is actually OK to use if you are careful.


What's wrong with puts? GCC for example will even change a naive printf to puts when applicable.


You're right about puts. I guess I am just used to avoiding both.


K&R is turning in its grave.


You made me check that K is still alive!


Koshkin wrote K&R and not K|R.


There are persons working on git who abuse strcpy()? This sounds a bit like activism and not based on reality.


musl libc (an alternative libc) provides implementations of these functions that are memory-safe. (It only works on Linux though.)


Using musl as your standard library causes other problems, namely horrible Python performance and incompatibility with Valgrind.


musl is standards-compliant, so it sounds more like "using python or valgrind causes other problems,"


What's slow in particular and on what arch?


Alpine. Slow in particular? I don't have the resources to profile everything in our codebase and find the root issue, but others have already benchmarked it: https://superuser.com/questions/1219609/why-is-the-alpine-do...


Is is not possible to provide implementations of these functions which are both safe (meaning, within human ability to use correctly) and standards-compliant, as they either don't take any size or require leaving off the null terminator when the destination buffer is full.


If you have a critical need for memory safety then I think it's time to move on from C.

I also have a (I guess?) Slightly unpopular opinion, that C is actually really ugly. Subjective, but C code is usually prone to repeating itself a lot or abstracting in an unsafe manner


What functions does musl provide?


It provides everything that libc does.


So how does it solve the issues with these functions?


It doesn’t — I was wrong.

I was digging through its source code, and it turns out it isn’t memory-safe after all. Here’s the source code for stpcpy: https://github.com/ifduyue/musl/blob/master/src/string/stpcp...

If dest is too small in the function linked to above, musl’s stpcpy will happy cause a buffer overflow.

Don’t know how or from where I got the impression that musl was a “safer” alternative to libc.


The API of strcpy() is unsafe by design. The only way to make it safe is to not use it, and it's not musl's (nor glibc's) fault that this is the case.

musl is interesting for a variety of other reasons, but being more memory safe isn't really one of them.


strpy() is safe when used safely. That means that you've confirmed that the target is big enough.

For example, there's no way this can fail:

    char s[10];
    strcpy(s, "hello");
The problem is that it can also be used unsafely, and except in the simplest cases it's difficult or impossible to tell whether a give usage is safe.

By contrast, gets() (which isn't even in the language anymore) is inherently unsafe, because you can't control what will appear on standard input.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: