I'm glad to see that setjmp() and longjmp() are still allowed.
I'm just kidding by the way. For those C programmers who haven't encountered these before, it is a powerful way to do a "goto" in C. Powerful in the sense that you can jump anywhere, not limited to the same function. If it's used at all these days, it's used for exception handling.
The way I see the list is that it isn't meant to be exhaustive, it's meant to be functions that contributors are likely to try to use. A new contributor might not realize that they should be using Git's internal strbuf.h instead of the libc string functions. It's a way to give them feedback on that from their compiler, before they spend more time on the patch and send it to the mailing list.
Also seems mostly for functions which look innocent and are likely to be used but are very easy to misuse.
setjmp/longjmp is not in this category, it might be argued to be easy to misuse but doesn't look innocent and is not likely to be used in normal codebases.
I had quite some fun playing around with it to obfuscate code. I did find that jumping to a different position in the same expression broke on a lot of compilers as soon as optimization was turned on.
Here's a simple program to output a string with the characters in an unusual order.
#include <stdio.h>
#include <setjmp.h>
#define J(x,y) (longjmp(x,y),0)
int main(int argc, char **argv)
{
jmp_buf j[011];
int x,X=0;
signed char* i = " eehcef lo u ef'oYurls " +021;
(x=setjmp(j[X++]))?J(j[x],*++i):((x=setjmp(j[X++]))?J(x[j],*--i):((x=
setjmp(j[X++]))?J(x[j],(putchar(*i),*i+=128)):(x=setjmp(j[++X]))?((x<
0)?J(1[j],5):J(2[j],3)):(x=setjmp(j[++X]))?((x>>X)?J(1[j],6):J(1[j],3
)):(x=setjmp(j[++X]))?((x&0x80)?J(1[j],7):J(1[j],5)):(x=setjmp(j[++X]
))?(((X<<5)&x)?J(1[j],4):J(2[j],X)):(x=setjmp(j[++X]))?((x&128)?J(0[j
],X):J(2[j],4)):(setjmp(j[3])>=0)?J(j[2],6):0));
return 0;
}
> the same expression broke on a lot of compilers as soon as optimization was turned on
C is not a "try it and see" language. If you stray from the spec then things will act differently and break between compilers. Section 7.13.2 specifies that only local variables declared as volatile will have a determinate value after a longjmp.
ISO C says that any local variables that are not declared volatile whose values are mutated after the context is saved with setjmp then have indeterminate values when the context is restored with longjmp. So yes, that interacts with optimization. Any variable that is cached in a register may revert to its old value when that register is restored blindly from jmp_buf.
1. The first time through this giant ternary defines 9 different small subroutines. Each subroutine can consider the value of "x" to be their argument. X is used as an index into the jmp_buf (subroutine) array during setup, but after that it's always just equal to 8
2. Subroutines 0, 1, and 2 are special. Their argument is the index of another subroutine to jump to. They mutate the character buffer state and then jump to the specified subroutine
3. Before jumping, subroutine 0 moves back a character, subroutine 1 moves forward a character, and subroutine 2 prints the current character and marks it by setting its high bit. Once a character is marked, it will never be printed again
4. All other subroutines get the current character value as their argument
5. Subroutines 3, 4, 5, 6, 7, and 8 all check, in different ways, whether the current character has been marked: x < 0, x >> 8, x & 0x80, (8 << 5) & x, x & 128, x > 0
6. Execution starts at subroutine 3
So you can annotate the code as:
// Subroutine 0:
// move to next char;
// goto x;
(x=setjmp(j[X++])) ?
J(j[x], *++i) :
// Subroutine 1:
// move to prev char;
// goto x;
((x=setjmp(j[X++])) ?
J(x[j], *--i) :
// Subroutine 2:
// print and mark (set high bit on) curr char;
// goto x;
((x=setjmp(j[X++])) ?
J(x[j], (putchar(*i), *i+=128)) :
// Subroutine 4:
// if (curr char is marked) {
// // goto 1(5);
// move to prev char;
// goto 5;
// } else {
// print and mark curr char;
// goto 3;
// }
(x=setjmp(j[++X])) ?
((x<0) ? J(1[j],5) : J(2[j],3)) :
// Subroutine 5:
// if (curr char is marked) {
// // goto 1(6);
// move to prev char;
// goto 6;
// } else {
// // goto 1(3);
// move to prev char;
// goto 3;
// }
(x=setjmp(j[++X])) ?
((x >> X) ? J(1[j], 6) : J(1[j], 3)) :
// Subroutine 6:
// if (curr char is marked) {
// // goto 1(7);
// move to prev char;
// goto 7;
// } else {
// // goto 1(5);
// move to prev char;
// goto 5;
// }
(x=setjmp(j[++X])) ?
((x & 0x80) ? J(1[j], 7) : J(1[j], 5)) :
// Subroutine 7:
// if (curr char is marked) {
// // goto 1(4);
// move to prev char;
// goto 4;
// } else {
// print and mark curr char;
// goto 8;
// }
(x=setjmp(j[++X])) ?
(((X << 5) & x) ? J(1[j], 4) : J(2[j], X)) :
// Subroutine 8:
// if (curr char is marked) {
// // goto 0(8);
// move to next char;
// goto 8;
// } else {
// print and mark curr char;
// goto 4;
// }
(x=setjmp(j[++X])) ?
((x & 128) ? J(0[j], X) : J(2[j], 4)) :
// Subroutine 3:
// if (curr char is not marked) {
// print and mark curr char;
// goto 6;
// } else {
// exit;
// }
(setjmp(j[3])>=0) ?
J(j[2],6) : 0));
// X = 8
// goto 3(0);
A more straightforward way of writing this would be
#define IS_CURR_CHAR_MARKED() (*i < 0)
#define PRINT_AND_MARK_CURR_CHAR() do { putchar(*i); *i += 128; } while (0)
int curr_sub = 3;
for (;;) {
switch (curr_sub) {
case 3:
if (!IS_CURR_CHAR_MARKED()) {
PRINT_AND_MARK_CURR_CHAR();
curr_sub = 6;
} else {
return 0;
}
break;
case 4:
if (IS_CURR_CHAR_MARKED()) {
i--;
curr_sub = 5;
} else {
PRINT_AND_MARK_CURR_CHAR();
curr_sub = 3;
}
break;
case 5:
if (IS_CURR_CHAR_MARKED()) {
i--;
curr_sub = 6;
} else {
i--;
curr_sub = 3;
}
break;
case 6:
if (IS_CURR_CHAR_MARKED()) {
i--;
curr_sub = 7;
} else {
i--;
curr_sub = 5;
}
break;
case 7:
if (IS_CURR_CHAR_MARKED()) {
i--;
curr_sub = 4;
} else {
PRINT_AND_MARK_CURR_CHAR();
curr_sub = 8;
}
break;
case 8:
if (IS_CURR_CHAR_MARKED()) {
i++;
curr_sub = 8;
} else {
PRINT_AND_MARK_CURR_CHAR();
curr_sub = 4;
}
break;
}
}
That's quite impressive work. It becomes a bit clearer if you think of the string as a tape. Then a big case statement with Left,Right,Mark seems rather Turing complete :-)
> you can jump anywhere, not limited to the same function
Outside of the current function, yes. Anywhere, no.
Longjmp allows you to return to places you’ve already been and have saved with setjmp. Also attempting to longjmp to a function that has since returned is undefined behavior. Thus longjmp’s usual use case is like exceptions in other languages, going back up the call chain, skipping intermediate functions.
I think reasonable people can debate this one. setjmp()/longjmp() are very useful when working with libpng [1] and jpeglib [2], as you mentioned, as a crude exception handling mechanism. I have seen these functions used in production safely for that purpose. I can imagine other horrible uses for them though. Unlike things like strcpy() which are security-holes-by-design, setjmp()/longjmp() should be in a "carefully code review the exact way these are used" list rather than a banned list.
The trouble is how they compose with the expectations of code you don't write. It's just not normal in C to expect execution of your function to abort halfway through, and so memory leaks and broken state are almost guaranteed when they're mixed with the wrong third party library.
It might work great when you wrote it, but it might not even survive the next change to the codebase, and 10 years out, who is to know what programmers will come along and plug new stuff in without noticing the stack magic, etc
edit: another aspect of the same issue is that it can be difficult or impossible to write generic longjmp-safe code for many kinds of tasks. In C++ you have stack unwinding to e.g. delete large temporary heap allocations during a complex operation, but no such facility exists in C.
longjmp will not interrupt third party code (well maybe if it uses callbacks and you use it from a callback).
setjmp/longjmp will almost certainly require tracking any resource use carefully so that resources can be released at the setjmp point, when execution gets back there.
I mostly use it for terminating the recursive descent parsing on error. Otherwise it gets tedious to check return values everywhere in hand written code.
There is no reason libpng needed to use setjmp as part of its API, and that it does so is (I hope) widely regarded as a bad decision. It's "useful" only in so much as that is the API it defines, but there is no reason that this had to be the case, and simply returning an error code would have been very very much preferable.
These are used /everywhere/ in PostgreSQL, exactly for exception handling. The result isn't bad at all, but indeed, too powerful a tool for 99% of developers
I think that it would have been even funnier to say that 99% of developers and 99% of drivers are better than the median, but I only figured this thanks to your parent's comment.
Depends which average it refers to any of the means, or the media, or the mode can be “average” (most typically, though, it means the arithmetic mean), and it's only impossible in terms of the median.
Not native speaker here, but I think, he just meant that, most programmers and drivers thinks they are better in respective things then, they really are.
It is just like letting programmer build a house. It is just a really bad idea.
Goto requires the fail block to be in the same frame. It can be a working error strategy for failing critical errors in such a way from a failing module. You have to be aware that lots of stuff is prone to leak (which can be avoided by i.e. using some memory arena) and locks/mutexes/... might be in unclear state, so code has to be aware of that.
Let's take the example of PostgreSQL, which was mentioned in this thread: A potential implementation (I have no knowledge of pgsql internals) might be that for a single user session all memory is from a single arena, all locks and other handles are tied to the session handle, then it can be an efficient strategy to longjmp out on a critical error (i.e. IO error) and clean the arena as well as things tied to the session. instead of bubbling this up through all frames.
It is also the only way for an xlib program to survive losing its connection to the X server. xlib will let you register a callback to handle loss of connection with XSetIOErrorHandler, but unconditionally calls exit() after your callback returns. Which is extremely anti-social behavior for a library, but that is what it does.
So the solution is to setjmp before each call to an X function that might notice the connection is closed, and longjmp out of the error callback. Ugly, but it works.
These days, you should really just use XCB, or even Wayland.
Well, I would not go that far. My coroutine hack relies on “unwarranted chumminess with the implementation” (http://c-faq.com/struct/structhack.html) and there are lots of incidental things that can break it. It is best treated as an example of the principles of thread switching and stack allocation.
I don't use setjmp() and longjmp() often, but sometimes I do; I use goto more often than those (but still not all the time). (I want goto in JavaScript too. I figured out a algorithm to do so, but have not implemented it.)
One use of setjmp/longjmp I have used is in ZORKMID, to deal with the debugger. At the beginning of the execute() function I have:
while(setjmp(exception_buffer));
(The semicolon is correct; the loop body is supposed to be empty.) Normally, if you enter the debugger and then you exit the debugger to continue the execution, then it will continue from where it left off, which is likely in the middle of the execution of some instruction, and may result the same error again. But, if you change the program counter before continuing execution, then the debugger will keep track of that and will use longjmp instead when continuing execution, therefore skipping the rest of the instruction that stopped.
Goto sounds like a generally bad idea in a more expressive language like Javascript. Can you give an example of some Javascript-with-goto code that would be better than vanilla Javascript?
I have sometimes encountered such thing (but I do not remember right now), but what I do know is that goto is more helpful in C than in JavaScript. That doesn't mean that goto isn't worth anything in JavaScript, but only that there are less circumstances where it is useful compared to the other stuff. (The same is true of macros.)
I remember setjmp() and longjmp() being the only way to do a STREAMS device driver I wrote 30 years ago. Don't know if things have changed since then. Ah, what memories.
I had really fun implementing coroutines, (almost-zero cost) exceptions, iterators etc with setjmp and longjmp. I think they're too useful to be banned outright (especially since C has no view on stack) since there are cases where you need to either use them or inline assembly, but I agree that they're the evilest of evilest std C functions.
Read David Tribble's "Go To Statement Considered Harmful:
A Retrospective" instead: http://david.tribble.com/text/goto.html . It contains the entire text of Dijkstra's paper and goes over its meaning in a modern context on a paragraph-by-paragraph basis. It's very well written and makes clear the "encourages spaghetti code" argument, while true at the time, applied in a era where basic control flow constructs that we now take for granted did not exist in commonly used languages.
Thank you for the recommendation; that was a very interesting read. I find that I can't quite agree with the last "nontrivial goto" example, though. This version without explicit gotos is actually shorter, and IMHO not really any more difficult to follow:
int parse(void)
{
Token tok;
while ((tok = gettoken()) != END)
{
while (!shift(tok))
if (!reduce(tok))
return ERROR;
}
return ACCEPT;
}
I will say that any language without goto, or where the use of goto is discouraged, should at least provide guaranteed tail-call elimination as an alternative. In this case the gotos could easily be converted into loops, but not every algorithm is so accommodating.
Not using it can also create spaghetti code. For example the "alternatives" often suggesting for exiting a multi-tiered loop (which often add additional boilerplate, additional variables, etc that all need to be maintained/bug free).
Indeed. Despite not having actual goto, Java went out of their way to have labeled loops and the "break label_name" syntax in order to avoid this awkwardness.
I see zero use of those in the codebase, so perhaps they should be banned too. Or maybe they’re not on this list because nobody was using them already?
I have worked in large code bases (RSA NetWitness, Oracle Database, etc.) where setjmp/longjmp were used to create macros for exception handling using TRY-CATCH-ENDTRY blocks in C.
Personally, I feel like wrapping exception handlers in macros makes it easy for people to do unsafe things with it like dynamically allocating memory or calling system functions that can’t handle it correctly.
I see a lot of comments to the effect of "shouldn't XYZ also be banned". The answer is that we're not necessarily trying to be exhaustive. The point is to flag common errors before we even hit review, so we add new functions mostly when somebody tries to misuse them. I don't recall anybody trying to abuse longjmp() in Git's codebase yet (and no, that's not a challenge).
Peff, for the people asking in the thread, is there a place where correct alternatives are suggested or demonstrated?
I know there are a few different places that talk about how to use git's internal machinery, but not sure if any are specific to these banned functions.
The original commits mentions git's strbuf API[0] and its xsnprintf, a variant of snprintf which asserts that the destination buffer was big enough[1] (rather than just return truncation information).
For other codebases, snprintf is the usual recommendation, and careful straight buffer manipulation (mem*) iff performances are a concern.
The commit messages that add them to banned.h discuss alternatives, though most of the explanations are Git-specific and assume you'll look elsewhere to figure out how to actually use those alternatives.
There is talk about npm/rust having massive dependency trees (and they do) because it makes taking on dependencies too easy. But I feel like C is on the opposite side of the spectrum, where managing dependencies is difficult so every C code base is rolling it’s own version of everything.
C also has an expansive standard library but it hasn’t offset re-invention of that stdlib in the ecosystem. I just see those implementations trapped inside code bases that don’t export them in a consumable way for other projects...
Personally I love writing C but rarely do because dependency management makes innovation in those code bases so time consuming for me.
It's bad that C devs have rolled their own awful, buggy custom solutions.
However, here's the upshot-- that awful, buggy, custom solution is guaranteed to remain compatible with the codebase that contains it. You won't find a single instance where such a dev "improved", "refactored", "optimized", or "modernized" that awful, buggy code in a way that stopped the rest of the code from working and then shipped that broken blob of junk. And if you did find that, there's no way they'd convince the rest of the devs that this is a useful thing to do in the interest of "moving forward" or whatever.
On the other hand you'll find zillions of dollars spent on systems to keep dependency management tools from causing exactly that problem.
Just to give a real-world example-- a user reported that an abandoned C++ plugin wasn't working. We did a race:
1. Three devs try to get a single C++ dependency of that plugin to work cross platform (OSX, Windows, Linux).
2. I tried to port the C++ plugin to a C plugin with no deps.
By the time I ported, tested, and shipped, those three devs were tracking down a bug with the build script of the C++ dependency on Windows.
Edit: granted the plugin itself is only about 2000 lines of code.
Regarding remaining compatible, I would argue that this is why modern languages implement lock files and version pinning. If you don't like a particular change but need some extra functionality or a security fix you can fork the relevant libraries or extend the functionality with an extra self written library.
It's not a guarantee that it will work, it's not that rare to encounter a minor/patch version in a library that introduced some subtle incompatible change (usually unbeknownst to the author)
That's literally the purpose of the lock file. The lock file locks the entire dependency tree. So unless you're bumping versions or you fail to save the lock file, the entire dependency tree's versions will remain the same.
>some subtle incompatible change
In statically typed languages this normally isn't an issue. Of course I'm aware that logic can also be changed, but in that case it's up to you to write appropriate tests (or just don't bump the versions of your libraries without a good reason).
It still sucks. I have a few Haskell projects from a few years ago that I wanted to compile on another system. So I froze the dependencies, moved the project, and tried to build. Solver failed. I gave up.
As far as I'm aware, Haskell's dependency system doesn't have a lock file. That's the key part to keeping things stable. Version pinning alone isn't enough.
That was very true for a long time. But now with cmake and a package manager like hunter, it became quite easy and pleasant to work with dependencies. ExternalProject_Add() may also work.
Creating a lightweight library would make C programming much more productive, and I'm sure many have tried, but I wonder why none of them see widespread applications, instead of writing some half-baked, ad-hoc helper functions.
One reason, I guess, is the diverse range of applications of C, another reason is the lack of advanced features like generics and templates.
But it would still be useful to create a simple library, intended for Unix-like platforms (beyond the BSD extension of stdlib). So it's more of a cultural problem?
If I had to guess why there are no such lightweight library, it is because when somehow the library API doesn't provide what you want, you cannot hack the library itself easily. If you import the source code you may have trouble building it, and maintening such "patch" is not a lot of fun.
Yes. I guess the fact C doesn't have advanced features like generics and templates probably have made the problem worse, making it difficult to create a "drop-in" API.
There is a powerful C library that covers nearly all use cases, you merely need to write a configuration file for it. Perhaps you’ve already heard of it? Tcl.
Thanks for linking this, it's always good to see how very strong engineers solve problems. Never in a million years would this approach have occurred to me. For the interested, the README in the linked repo is done well and clearly explains the approach itself as well as pros and cons. This also reminded me that I've been meaning to watch antirez' "Writing System Software"[0] videos on YouTube for quite a while now, so thanks for that too.
I'd rather make it explicit that returned/passed value is something other than a regular zero terminated string. Otherwise you can easilly make a mess, if you pass zero terminated string to a function that expects some special bytes in the memory before the pointer.
I do this, but I like to just return the actual struct and also avoid typedefing pointer semantics. Can double as a dynamic array too. It's true. We all have string libraries...
There are far too many. APR is a solid choice -- very portable, well-tested, and fast. It uses memory pools a lot, which work well for many styles but not everything.
IIRC, StrSafe.h - and much more - came from the development of the Microsoft Security Development Lifecycle (SDL).
Code Red was the wake-up call for Microsoft and in February 2002, based on a memo from Bill Gates that first coined the phrase "Trustworthy Computing," Microsoft shutdown Windows development for the first time ever to get a handle on the security issues the products were facing.
Additionally we get to enjoy bounds checked arrays(std::...), and iterators on debug builds, with possibility to selectively enable them in release mode.
While Windows by all means still has its security issues, the toolchain is much more security oriented than most FOSS alternatives thanks to the Windows XP wake up call.
Android and ChromeOS are probably the mostly locked down alternatives on the FOSS space.
That's how secure boot should work. Replacing the root of trust should require serious physical access that can be tamper-evident. And yeah, out of the box, the trust is with the vendor — who else would be trusted in a device that doesn't have an owner yet?
Developer mode lets you boot another OS, but leaves you open to accidentally wiping your drive if you hit the wrong button on boot. Replacing the firmware is a good idea, you do need to remove the write protect screw, but that really makes sense for some note of physical security.
I don't see a huge problem with strtok as long as it isn't a multi threaded program. It is not in gets territory which is literally impossible to use correctly.
Synchronous reentry is a problem. Say some loop using strtok calls a function... how do you know that that function doesn't use strtok (or call another function that does...).
Any use of non-const static variables in general has this problem, and strtok is just one example.
> Say some loop using strtok calls a function... how do you know that that function doesn't use strtok (or call another function that does...).
I’m constantly surprised that C isn’t specified in such a way that lack-of-reentrancy can be determined at compile time. You’d “just” need a symbol table, sort of like the debug symbol table, in each compiled object, holding for each visible symbol the set of function calls marked `__no_reenter` that are predominated by that symbol in control flow. (Yes, some functions do computed jumps, like with longjmp. Just implicitly label those functions __no_reenter unless the programmer explicitly labels them __reenter!)
The problem isn't exactly reentry. It's that the function, having already exited, has modified global state, and it needs the same state to be there on the next call. If another call is interleaved (another callee with its own strtok loop) then the state is overwritten with the inner loop, and your own next call won't have its expected contents.
Whereas with a term like reentry I think of two strtok frames on the stack at the same time, which is not possible.
I think it's a bit easier to conceptualize with a function that produces a simple return value in a static buffer. Take getpwnam(). It fills a global structure with info about a given user. So you may keep a pointer to that on your stack. Then you call some other function foo(). foo looks up another user. Suddenly you can't rely on that other call to getpwnam() not having overwritten the data in the result you got back the first time.
how do you know that that function doesn't use strtok (or call another function that does...).
The same way you know a lot of other things about the codebase: By reading and understanding it. Given what strtok() is used for, it's almost always going to be working on a single context at any one time.
First, many C projects are libraries that are written for reuse in other programs. So "the codebase" is an arbitrary, unknowable thing. We don't have the ability to look at all the other code that will eventually coexist in the same process.
Second, code changes over time. Assuming you inspect 100 lines of code to ensure that they don't call strtok() ... do you sprinkle comments all over as a warning to future programmers not to add any calls to strtok()?
Finally, assuming we could evaluate all of the code in a program at once, it is impractical to build non-trivial programs that way. Instead, we rely on contracts. The idea is that if a caller conforms to its end of the contract, the callee will conform to its end. This allows us to reason about correctness of some code without reading all the other code in the universe and thinking through all the potential code paths. Any function using strtok() would be presenting a rather ugly contract clause "this function destroys strtok's static state". Even worse, such a clause would be contagious, infecting any client of that function, and so on.
You can safely use gets on trusted input. For example, your program might create a pipe or socketpair to communicate between the processes that result from a call to fork. The stdout of one process feeds into the stdout of the other process.
Unless you have an insane gets implementation that writes the bytes backwards, you can also safely use it on untrusted input. Simply place a guard page at the end of the buffer. You can do the call from a forked child that you let die, taking notice of the problem in the parent. On systems without threading (because of locks in libc) you can do a longjmp to recover.
It sucks that there's no scanf("%.*s"), where you give the buffer length as a separate argument. The fact that it's in a string literal also makes using a compile-time constant really ugly.
Looks like the second vsprintf definition should be "BANNED(vsprintf)", not "BANNED(sprintf)". Not that it matters much, as it only affects the error message.
Aside: GitHub’s refusal to provide a feature to close pull requests for mirrors is utterly ridiculous since it requires that people (or bots) go around and disable pull requests and point people to the patch submission process.
the word "standard" has more than one meaning. it's not standard by the definition of standard that you chose, but you can use your context cues to realize that it's not the definition of standard that OP meant.
These shouldn't be used either (_s are available in C11, _l tend to be locale-parameter thread-safety, and _r have sizes/state for bounds/thread-safety):
- strtok -> strtok_r / strtok_s
- asctime -> strtok_r / strtok_s
- gmtime -> gmtime_r / gmtime_s
- localtime -> localtime_r / localtime_s
- ctime -> ctime_r / ctime_s
- dirname -> dirname_r
- basename -> basename_r
- devname -> devname_r
- readdir -> readdir_r
- ttyname -> ttyname_r
- gamma -> gamma_r
- lgamma -> lgamma_r
- lgammaf -> lgammaf_r
- lgammal -> lgammal_r
- atoi -> atoi_l
- atof -> atof_l
- atol -> atol_l
- atoll -> atoll_l
- gets -> gets_s
- scanf -> scanf_l / scanf_s
- fscanf -> fscanf_l / fscanf_s
- sscanf -> sscanf_l / sscanf_s
- tmpfile -> tmpfile_s
- fopen -> fopen_s
- getenv -> getenv_s
- strdup -> strndup
- strcmp -> strncmp
- strlen -> strnlen
- (Multibyte/wide conversion functions without mbstate_t parameter)
- calloc, realloc -> reallocarray (for arrays of non-byte items)
- memmove -> memmove_s
- strncat -> strncat_s
- strncpy -> strncpy_s
- srand / rand -> rand_r
There are others that are platform-specific. Thread-safety, internal mutable state (not thread-safe), and buffer-overflows are the primary concerns that aren't necessarily applicable in all situations.
Back in my individual contributor days I got into serious trouble because of this. Where can I find it has officially been deprecated? I tried and I can still use it without getting any warnings?
What surprises me in C developers is that C exists for probably 40 years but they still don't have proper strings (not just pointers). In many cases there is no large performance penalty for storing string length, and checking it, but they still use pointers or a separate pair of variables for pointer and buffer size instead of single object.
Because they have a security culture that you only get errors due to "holding it wrong" or "every good programmer does it right", despite evidence on the contrary.
Plus added the fact that early C compilers generated quite lousy code on 8 and 16 bit computers, there is this idea to micro-optimize each line of code as it is being written, without any profiling feedback of it actually matters, rather cargo cult how writting code like X is faster than Y.
For example, outside 3D rendering and audio software processing, I never saw a visible impact (to the end user) of bounds checking.
1) Totally optional from ISO C point of view, not all C compilers do do it, usually requires static analysis or specific compiler warnings to be enabled. Which not everyone does.
2) Solaris does it perfectly fine on SPARC thanks to tagged memory (ADI). Which Google in collaboration with ARM will make mandatory on future Android releases as well. [0]
3) It hardly mattered in MS-DOS and Amiga LOB applications developed in across Turbo Basic, Quick Basic, GFA Basic, Turbo Pascal, Clipper, so it matters even less nowadays unless we are speaking about PIC like hardware.
Regarding bounds checking I usually refer to Hoare's turing award speech, back in 1981:
"Many years later we asked our customers whether they wished us to provide an option to switch off these checks in the interests of efficiency on production runs. Unanimously, they urged us not to--they already knew how frequently subscript errors occur on production runs where failure to detect them could be disastrous. I note with fear and horror that even in 1980, language designers and users have not learned this lesson. In any respectable branch of engineering, failure to observe such elementary precautions would have long been against the law."
Or for that matter, the DoD Multics's B2 security evaluation[1], with several remarks how PL/I made the system safer with its string handling, pointer integrity validation and bounds checking.
As noted, yes there are niche cases where bounds checking does have an impact, but for a large spectrum of code that gets daily written it isn't the case.
1) It's not standardized but it's very simple to implement nevertheless.
It is my understanding that 2) doesn't have anything to do with C (so given hardware support, you can have it for free whether working in C or not. I think this kind of invalidates your point).
And also, 2) is not perfect, only probabilistic (a source I found says 94% likelyhood to detect OOB).
And it works only for the most basic situations where you use the system allocator and never subpartition these allocations. So, beyond these simple cases that we can get for free without any involvement from C semantics, I still maintain that it is unclear how to devise a reasonable a useful system to do bounds checking that can be added to the C memory model. We can make up an annotation syntax to cover many of the simpler cases, but these are hardly better than plain assertions (which I regularly use).
I doubt Hoare had a good idea to add general bounds checking on a low-level language like C, otherwise that would be standardized by now.
Pointer validation via hardware memory tagging has everything to do with C, because it is only due to C's shortcommings that millions of research dollars keep being spent to try to make it work.
In what concerns Solaris, and the requirements for future Android with ARM memory tagging, the system allocator is all there is, at least from official support point of view.
Hoare hardly needed to pursue such endevour, because all systems programming languages derived from Algol, like ESPOL, NEWP, PL/I, PL/S, PL/8, BLISS,.... were sane regarding bounds checking.
Hardware validation of memory is the only way to tame C, the alternative is to just dump the language, because as proven by ISO C11 dropping Annex K, very few actually care about making the language safe.
However given UNIX's dependency on C, it is also quite clear to me that in the next couple of decades C will be around, long after I am gone, and business opportunities to create companies on top of CVE exploits due to memory corruption bugs.
> Pointer validation via hardware memory tagging has everything to do with C, because it is only due to C's shortcommings that millions of research dollars keep being spent to try to make it work.
So, to restate, Solaris does not do it "perfectly fine". Thanks for making my point.
> In what concerns Solaris, and the requirements for future Android with ARM memory tagging, the system allocator is all there is, at least from official support point of view.
That's a pity, because if you're not doing your own allocators then you'll have to accept lock contention, extreme memory overhead (for smaller allocations, say <= 64 bytes), and you'll need to match every little allocation with a deallocation, instead of making e.g. custom pool allocators.
You're just not going to write a large infrastructure (i.e. performance-oriented) system in this way.
There are countless libraries that add higher level string functions and no end to higher level languages. C fills the niche where you want something higher level then assembly but lower level then Perl, ruby, python, etc. Sometimes you want or need to manage your own memory. Arduino is a good contemporary example.
Exactly. C philosophy is to use libraries and not put things like a better string library in the core functions of the language. It keeps the language relatively clean and easy to understand, unlike c++
So the C philosophy is to provide a bad standard string library that is not thread-safe and makes it impossible to write programs without remote code execution vulnerabilities, rather than have a better library with exactly the same functionality but with more security and safety "bloat"?
/snark
Yes, C used have a cavalier approach towards security in the past. But if you're asking why the standard library is not fixed yet, I think that instead of pinning it to some lofty philosophy, it's safer to say that good C developers realized long ago that the original C strings are a mistake. Most big C projects define their own string functions and often their own length-prefixed string types. The C standard committee just gave up on fixing this issue in the standard library, but this is not due to philosophy, but because of the impracticality to force a standard solution on this stage.
Bonus points when the libraries are so incompatible among themselves that require extra conversation steps, and then just get dropped 'cause "mind the performance".
When compatibility is needed, APIs should consume pointer+length pairs. It doesn't get better than that, in ANY language, in terms of simplicity and modularity.
Libraries like Qt with extreme lock-in are at the other end of the spectrum. If it works for them, that's nice. Doesn't work for me. I don't think using a string library is in the spirit of C programming.
Which means either pointer+length get packed into their own structure to avoid mix up errors, thus requiring conversion function calls across libraries, or they are given separately manually, thus opening the door to the copy-paste mistakes from pointer+bad length that they are supposed to protect against.
> Which means either pointer+length get packed into their own structure to avoid mix up errors, thus requiring conversion function calls across libraries, or they are given separately manually, thus opening the door to the copy-paste mistakes from pointer+bad length that they are supposed to protect against.
No, we were talking about compatibility/modularity. You're shifting the topic.
> Qt is as locked-in as any LPGL 3 FOSS project is.
I'm not speaking about licenses lock-in, but about lock-in from an engineering point of view. Are you aware of any significant Qt projects that don't have "Q" all over their codebase?
(And yes, by contrast to GPL, I believe you can use LGPL libraries without suffering a terrible amount of (license) lock-in)
Compatibility/modularity doesn't happen in the air, rather in written code.
So either one passes structs around, and somehow they need to be compatible.
Or one passes pointer + lenght as two separated variables, with the consequences to keep in sync two unrelated variables, from the compiler point of view.
The thing you are pointing to could have its length prefixed. I've always assumed that this isn't the case because nobody could commit to the size of the length prefix. Is it string8, string16, string32, or string64? Using a sentinel value to denote length is less opinionated and more portable.
Unfortunately, it's often algorithmically less efficient and more error prone, so everybody ends up replicating what git is doing with 'strbuf' but in slightly incompatible ways. The effect makes dealing with strings in C unnecessarily unpleasant.
It also fills all remaining bytes (if any) with nuls. ‘N’ in its name is not the same ‘n’ as in snprintf etc. It could be named fldcpy, as it works with 0-padded fixed-width fields rather than 0-terminated strings.
But still not safe, thanks to the return value. Imagine the case where you are using strlcpy because you can't be sure if the source buffer is properly null terminated.
If your buffer isn't a NUL-terminated, then don't call a function that is only defined for NUL-terminated buffers. It's as simple as that.
I'm baffled by how some people claim strlcpy() is 'broken' or 'not safe' because it doesn't handle non-NUL-terminated inputs; the exact same thing applies to just about any function in the C standard library that takes strings as input. Are functions like strchr(), fopen(), printf(), strstr(), setenv() 'broken' as well?
> If your buffer isn't a NUL-terminated, then don't call a function that is only defined for NUL-terminated buffers.
It's not that people want to pass strncpy source buffers that lack NUL termination, it's that strncpy in certain situations will not NUL terminate its results.
Speaking of strlcpy, it thankfully doesn't have the problem that strncpy does. However strlcpy is not in the C standard or in POSIX, so can't be used portably. In C99 snprintf is a better choice.
Yes, because cutting off the trailing NULL is a threat vector every C programmer needs to be conscious of whenever they're dealing with strings.
But it's also a problem because maybe the programmer is using strlcpy to grab the first five lines of a 10TB memory mapped file. If you're not thinking about the implications of that return value it can be a real surprise.
A C programmer reading this list knows exactly why they are there. It is not at all a controversial list.
Edit: I guess some down voter doesn't believe me but it continues to be true. They are all string functions. Most of them do not take an output buffer size, so a source string exceeding the destination buffer will overflow. Others, like strncpy, take an output buffer size but will not null terminate when exceeded, so the buffer size must be reduced by 1 by the caller and potentially manually terminates, which is too easy to not consider. But any C programmer with significant experience already knows these pitfalls. So it's not controversial.
You explained it in 2 lines that would make for great comments.
It seems useful to educate C programmers who don't have "significant experience" because, well, by definition, not every C programmer has "significant experience" and they are the ones who would benefit most from learning this.
It's not about being controversial: nobody's going to disagree with your explanation (it is what it is - it's not an opinion).
I don't remember how I learned this. But it happens pretty naturally working with C a lot and doing things like reading documentation and code. To be clear this is not elitism or something of the sort, I am not at all pessimistic that new people won't learn it. It just needs to happen in its due time.
So I would say... What good is a verbose comment to explain? Somebody who doesn't instantly understand why it's banned can read the big warnings on the manpage, or they can google it and land on good explanations on sites like Stack Overflow, or they can look at the function signatures and come up with a correct guess. Then boom. They know. And from there they can assess other interfaces and see if they suffer the same weakness.
That's not true. People don't staunchly avoid situations that have pointless sharp edges. Putting up caution tape, despite everyone being aware of the danger, is very often a useful activity.
Linkers already warn for this stuff. Doing it this way requires every source file to include banned.h, which I would guess is done by including it from some other common header, but that's not fool proof either.
Just open the docs. Seriously, first link in Google/DDG/Bing on topic of "<any c string routine> safety" will explain it to you. I am self-taught and I can see it.
Seriously, just looking at function arguments shows you, there is a problem under the hood. Reading docs also will point out.
It is open-source, read and learn from Git, Linux code and so on. If you do not have experience is sysdev or high security, just don't contribute there.
This is like saying, construction engineering is gatekeeping beginners, because you have to work for few years before you can start your own project.
I think you can. The whole point of being open source and using a project like cough git (since we're here), is that if you don't like it you can fork it. If you don't allow forks then I agree you aren't open source
That's gatekeeping. If a junior developer starts working on this project they might bother a more senior developer who has better things to do with their time, or inline the definition of one of the functions out of frustration.
Git, like the Linux kernel, is essential, basic infrastructure that everybody uses and that needs to remain reliable. It's not a volunteer training project for dilettantes and neophytes. Gatekeeping is a deliberate part of how these projects are run and the results speak for themselves.
Thanks for posting these. My company has a similar list of banned C calls so I was already familiar with the rationale. The initial patch[0] in particular is eye opening though. 157 lines of rationale and exploration of alternatives for 22 lines of uncontroversial code.
There are. It was mentioned that they are available in the 'commit comments' on 'banned.h', but you can find them by blaming down to the original commit which introduced each line.
> As many already know, strcpy(3) is a very unsafe function call in the C library (you should always use strncpy(3) unless you can be exactly certain about the number of bytes you’re going to copy beforehand).
strncpy pretends to be safe but fails to NULL-terminate and has other performance issues that generally lead me to believe it’s a security placebo. I would recommend using snprintf instead.
Whenever you write, you have to consider your audience.
I'd assume there are no comments explaining this because for the intended audience, it is self-evident. "People browsing Hacker News on a Sunday" was probably not the intended audience :-).
Just to make it clear. I'm aware of the fact that these functions have security implications. Nonetheless when such decisions are made, I believe it's important to explain them—at least briefly—directly in the code itself.
With the way they are banned, you see it only as a compiler warning on implicit declaration of function (or at linking stage, if warnings are suppressed; idk what exact build process is). Afair, there is no evidence of banned.h in an error trace. So you have to look it up by yourself anyway.
They could rename them as “sorry X is unsafe and creates too much trouble” maybe, but in today’s internet “X considered harmful” is a common search query that everyone in the field is expected to know.
Fair point. It seems that more and more projects use "-Werror" which causes the compilation to stop and show you exactly where the error comes from[1]. Second point is that searching for "is_a_banned_function" in the code directly points to this file.
Because Git's own developers know to use git-blame, git-show, and write good commit messages. Comments are not the only place to store meta-info about why code is the way it is.
Of course Git has its own developers, just as the Linux kernel does. There's no moral, technical, legal, or organizational obligation for the maintainers of any open source project to accept or solicit changes from randos.
Microsoft seemed to have been adopting Git already, so it may be that acquiring Github was merely part of a broader Git strategy that Microsoft has adopted.
These tools have lists of functions not to use. Most of them — at least the security-focused ones — likely also include: strcpy, strcat, strncpy, strncat, sprints, and vsprintf just like banned.h
Delighted to see strncat there, it's a buffer overflow waiting to happen, especially as the name sounds like it's the 'safer version' where it takes a size argument, except it's not the size of the destination buffer, it's the remaining size, so unless you know this, it's wrong for every non empty buffer. strlcat (not standard) operates how you expect.
Whenever I review code and see strncat there, it's almost guaranteed to be a bug. This is for the simple reason that nobody remembers exactly what strncat does, and so misuse it.
How? The preprocessor is ran before every thing else? It will overwrite the defintions in the header just like in the source file? You still would get a linking error. The only reason I couldn think is if the the stdlib headers also called undef for those functions. Moreover this is all during build not runtime.
If the program declares or defines an identifier in a context in
which it is reserved (other than as allowed by 7.1.4), or defines
a reserved identifier as a macro name, the behavior is undefined.
It doesn't matter what the macro would expand to; simply defining the reserved identifier as a macro triggers undefined behavior. That doesn't mean it won't work, for some specific combination of standard headers, compiler, and program source code. It just isn't a strictly conforming program. A conforming implementation is allowed to flag this as an error, ignore the definition, or simply generate nonsense output.
Good to see the list is short and relatively sane compared to other "banned function" lists. Unfortunately, "too easy to misuse" is a slippery slope, and gets(), which is probably the best example of a function which is really broken by design, isn't on that list.
I'm surprised that "complicate audits" is given as a reason, because isn't this something static analysers (and I mean ones that actually analyse data/code flow, not dumb pattern-matchers) can easily detect? It's really just asking the question "how long is this/can this be" and following the data back to its origin(s).
Theoretically it's equivalent to the Halting Problem, but in practice I've not seen such difficulty; if aliasing does become a problem to the extent that following dataflow is difficult, I suspect there are already far deeper design flaws in the codebase.
Well your std c library is generally just a collection of .o, .a, .lib files. So you could use gnu bin tools to remove symbols of your choosing. However good luck getting your system to build software because that would modify the stdlib for all things compiled on your system. You could also tell gcc or clang not to link to the std library and providr your own with just a few switches on invocation of the compiler
Other thought would be to put the dlls with unsafe functions in different location than the safer ones and throw up a warning when an application loads them.
Yes, and interestingly, because even when used correctly they "complicate audits". This is an interesting use of preprocessor macros, I'm strongly debating introducing something like this at work.
I'm not an expert in C, but then what's the issue with strncpy() or any "n" functions? It prevents overflow AFAIK. Also what is the alternative (memcpy?) and why?
strncpy() does not guarantee that the copied string would be terminated with a null byte ('\0'). For a call that looks like strncpy(dst, src, n), if there is no null byte in the first n bytes of src, the string copied to dst would also not contain a null byte.
Here is an example code to demonstrate the problem:
If the strcp() macro is used with a function as the first argument (or an expression that has side-effects), it's not obvious at the call site (or, probably, intended by the programmer) that the argument will be evaluated twice.
expecting this to copy the beginning of the string "foobar" into a newly-allocated 4-byte buffer. Oops... this will actually call malloc twice (leaking the first buffer), and it won't have written the intended '\0' into the buffer that actually ends up getting used, so all bets are off...
If strcp were a function, its first argument would be evaluated just once, and it would work as intended.
strncpy have the danger of not putting a null character at the end of the destination , if source is longer than num.
> Copies the first num characters of source to destination. If the end of the source C string (which is signaled by a null-character) is found before num characters have been copied, destination is padded with zeros until a total of num characters have been written to it.
No null-character is implicitly appended at the end of destination if source is longer than num. Thus, in this case, destination shall not be considered a null terminated C string (reading it as such would overflow).
The issue is that strncpy doesn't guarantee that the destination string is null-terminated after a copy, particularly in the case where n is shorter than the original string, so you could run into overflows if you then ran strlen() or similar against it.
strlcpy() is a nice alternative, it's from OpenBSD from 1996. Was later added to the other BSDs, Solaris and macOS. But, last I checked, Linux never added them. :(
strlcpy is a terrible alternative. I have never seen it used correctly. It takes more and uglier code to use it correctly than to use strlen and strcpy. People who use it don't bother.
Finding use of strlcpy in a program is a red warning of sloppy code.
There are good reasons it was kept out of glibc for so long.
I agree that strlcpy is braindamaged, but why in God's Green Earth isn't there a sensible replacement in stdlib in 2019? This shouldn't be that hard, yet the best alternative is I think snprintf, which is just so ugly. For a long time it wasn't used because compiler support was spotty, but it's been 20 years now so that shouldn't be a major issue anymore.
The (more-or-less) sensible replacement is std::string. Adoption has been spotty, for practical, ideological, and fetishistic reasons.
The first step is to compile the C program with a C++ compiler. Next, start making improvements. If you skip the first step, the ceiling on improvements is limited.
It doesn’t ensure that there’s a null terminator in the resultant string. This may potentially lead to a later overrun when the copied string is read back.
It wouldn't mean breaking things on compilers other than GCC. It would simply add compile-time checks when building with GCC, and would do nothing on other compilers.
Responsible? What an hyperbolic statement.
Could you enlighten me on what would be the significant negative consequences?
I hope that your mental equation take into account the missed usefulness of GCC only attributes, flags and extensions by being clang compatible.
It's not meant to be a substitute for humans reviewing patches. It's just meant to catch things new contributors are likely to try to use, and give them feedback on it before they send the patch to the mailing list.
Truly curious as I don't code in C, or any other low level language. What is the problem with it? That there might not be enough memory allocated at *q?
I think gets() especially should never be used (but puts() is OK), but I think strcpy() and all of the other stuff they mention there is actually OK to use if you are careful.
Is is not possible to provide implementations of these functions which are both safe (meaning, within human ability to use correctly) and standards-compliant, as they either don't take any size or require leaving off the null terminator when the destination buffer is full.
If you have a critical need for memory safety then I think it's time to move on from C.
I also have a (I guess?) Slightly unpopular opinion, that C is actually really ugly. Subjective, but C code is usually prone to repeating itself a lot or abstracting in an unsafe manner
I'm just kidding by the way. For those C programmers who haven't encountered these before, it is a powerful way to do a "goto" in C. Powerful in the sense that you can jump anywhere, not limited to the same function. If it's used at all these days, it's used for exception handling.
More info: https://en.wikipedia.org/wiki/Setjmp.h