In principle, I feel like "Ignore the language-provided standard libraries and use this little macro I wrote in all your code" is bad advice.
If the standard libs are that bad, the community should really lobby for the inclusion of a "safe" string copy into the C standard. Failing that, use a battle-tested third-party library like bstrlib[0].
The C standards committee is incredibly conservative, only adding new features if absolutely necessary. Unless the composition and goals of the committee drastically change, a new standard string API will never happen.
If you must write C, you should use a real string library, and not string.h.
strcpy_safe is a template function, not a macro. That immediately makes it orders of magnitude better than you were claiming.
I agree that a standard fix for this would be great. It's been five years since I wrote the post, the solution has been available for many years longer than that, and the standards committee has shown no interest in fixing this issue. In the interim I think it would be irresponsible for developers to continue using strncpy or any of the other unsafe options when strcpy_safe is so simple and effective.
What choice do you have? “Fix the standard” is ideal, but few programmers have the luxury of waiting a decade for the fix to happen and become available in all their target compilers.
Strings are so tiny an abstraction over plain memory that every C programmer ends up doing them their own way. And that can be a good thing. I already proposed a couple of ways to represent or muck with strings in my other comment.
How is that a good thing? There are ways of abstracting/handling strings that are strictly safer than other ways. In an unsafe language like C, you're inviting problems by trying to do things in your own idiosyncratic way.
Sometimes performance or flexibility are important concerns. But feel free to manage millions of small strings with std::string or even as GC'ed objects :). And incur the overhead of indirection when attaching bounded-size strings to other data.
Defaults are important. C++ provides a much safer string type as default and let's programmers do their own thing in case it doesn't work for their use case (I'd try providing a different allocator before writing a different string type tbh). C on the other hand doesn't really provide a string type at all and forces everyone to start by writing a string library if they don't want to hunt buffer overflows until hell freezes over.
In what way could ever that be a "good thing" over having 2-3 implementations of strings in the standard library that cover 99% of the different ways and trade-offs, and which each has a method to convert a string to any of the others?
Rather use safeclib. It is C11.
It has a safe version of the wrongly spec'ed truncating functions, in this case strncpy_s. This version guarantees NULL termination.
strcpy_s is a good option for those cases where the appropriate response to having too long an input string is terminating the program. However it is a poor choice when continuing execution is desired.
Two years after this post, the Linux kernel added strscpy, the api of which is equivalent to the safe strncpy in this post. Internally, it stops copying once it reaches the null terminator.
The article's proposed implementation uses a template to guarantee that egregious mistakes will fail to compile. This is clever, but it's C++. The kernel's version isn't quite as safe because it can overrun the destination. In principle it's impossible for it to be safe, because this is C, and neither the compiler nor the programmer can tell how big the destination is. We have to rely, as the article complains, on the programmer calling the function to know how big the buffer is.
I've had patches rejected from the Linux kernel that tried to switch away from strncpy in favor of strlcpy. The behavior that strncpy continues to zero out the rest of the destination in the case where size of source is less than destination was being relied upon to not leak uninitialized memory to userspace. Seems strscpy alone suffers the same.
strscpy seems like a nice interface. Is an implementation outside of the kernel available? Preferably one that is permissively-licensed?
Also, it doesn't appear to be specified what happens if the count argument is too large to be represented as a ssize_t. The destination buffer would have to be extremely large, so it probably doesn't happen in practice, but it'd be good to specify it, or at least explicitly state it's unspecified / undefined.
It's too bad strncpy_s is so unwieldy to use. Having to pass the TRUNCATE flag is frustrating. I really want a two argument (dst, src) function that guarantees null termination.
> strlcpy is designed to solve the null-termination problems – it always null-terminates. It’s certainly an improvement over strncpy, however it isn’t natively available in VC++.
Let's not forget that it's also not natively available in glibc. Though you'll find it in every BSD you can think of.
Like snprintf(3), the strlcpy() and strlcat() functions return the total
length of the string they tried to create. For strlcpy() that means the
length of src. For strlcat() that means the initial length of dst plus
the length of src.
So if you've mmaped in something like the OpenStreetMaps xml dump file and try to strlcpy out the first 100 bytes (because you didn't want to have to null terminate it yourself after a memcpy), you'll notice that your program is running rather slowly.
It's not about trusting user input. It's asking for a function to create a 100 byte substring of a very long string and expecting it to take time~100 not time~len(src).
In fairness, that is probably just not the target "market" for strlcpy. Presumably, it is meant for "please copy this whole string which I expect to fit in the target (but catch me in the rare case that it does not)".
It does trust the user input. If you are being attacked and the input has lost its null terminator then your strlcpy might core dump even if it doesn't leak data.
They made it this way to make it more of a drop-in replacement for strncpy, but IMHO they should have changed the return value to be the number of characters copied. If your return value is less than n, then your string was copied completely, if it is equal to n then your string was truncated.
There are good reasons to use strlcpy over memcpy. For example you have a text parser where the most common case is each string is only a few bytes long, but you need to be able to handle odd cases where they are much longer. So you have a large buffer that you only use a tiny chunk of most times. With strlcpy it will be quick, but memcpy will be chugging the whole buffer each time.
I feel like there are a lot of good alternatives to C. What we did on our team was get our C compiling under g++, and then refactor as C++14. Now string copies are trivial and safe. I'm sure it would have been even safer to switch to Rust or something, but no one in my company has any experience with that, while there were a few C++ gurus.
What are the good alternatives? You can either copy memory manually, or have a system like C++'s that causes lots of tiny allocations, which is very inefficient. Both in space and time. You can also opt for immutable strings to be able to share strings more conveniently and efficiently, at the cost of O(n^2) build time (where automatic optimizations fail) or decreased convenience (explicit string-stream aka string builder).
If efficiency is not a concern for you and you rather like the convenience, good for you.
Personally I feel plain const char pointers (plus length, depending on expected size) are good enough to emulate the immutable string type. For modifiable strings, I still prefer viewing them as chunks of memory, using char pointer + length + capacity. Nice thing is, you can easily go from the former to the latter in the lifetime of any one string. You can also pool many strings in large chunks if you build only one string at any one time.
This is a horrible stance and this is repeated like broken record. Please stop saying this? A good software engineer should know (1) a suitable data structure to solve the problem in hand (2) algorithms that have certain asymptotic characteristic that makes most sense for the problem (what assumptions can you make). "Premature optimization is the root of all evil" can be used for and only for "optimization". Choosing the right data structure is never an optimization, it's a design choice. You can't just start coding everything with a linked list and then optimize bottlenecks, that's just not how it works.
As you say "A good software engineer should know a suitable data structure to solve the problem in hand".
So I expect a good software engineer on my team to be able to design a data structure where the fact that a linked list is being used doesn't hinder to change it afterwards to an array, a B-Tree or what have you.
Without changing a single line of client code I might add.
Such good software engineer will certainly had attended an algorithms and data structures class on their CS degree and be aware of architecture designs based on Abstract Data Types.
Yes, even in C with its flaky translation units instead of proper module systems, it is possible to implement Abstract Data Types application architectures, which any good software engineer can easily tackle.
And for those that aspire to reach that level, here are a few examples.
"Algorithms + Data Structures = Programs" by Niklaus Wirth,
1st edition is Modula-2, the 2nd in Oberon
"Introduction to Algorithms" by Thomas Cormen, Clifford Stein, Ronald Rivest, Charles Leiserson
"Abstract data types and Modula-2: a worked example of design using data abstraction" by Richard Mitchell
"Abstract Data Types in Modula-2" by Rachel Harrison
"Abstract Data Types and Algorithms" by Azmoodeh, Manoochchr
Your comment still doesn't make sense, even though you decided to choose a certain ADS and then latee decide on the implementation later based on empirical performance data, this is still not unusual. For (1) this is not how people write C. There are ADS libraries in C with which you wouldn't need need to use any DS that you can't change but this is not a usual way to write C code. This will also come with the overhead of pointer indirection and worse cache performance since all your objects will be void* stored in the DS (unless you m4 or macro generate your DS with all datatypes. This is what linux does for example, and I sometimes use this technique too, but this hardly the usual way to write C code). And (2) usually complex problems require special enough datastructures that it doesn't make much sense talk about ADSs. Everyone can code basic foobar with List interface then decide to use ArrayList or LinkedList based on performance. It's more interesting when the programmer decides to use a set, a hash map, a bloom filter etc to solve the problem as any other type wouldn't make sense.
It's not about the code, but about choice of data structure. Unfortunately that has to be planned ahead to some degree. You can't really optimize it after the fact. Typically a choice affects more than a single line of code, but that doesn't mean "microoptimizing every line of code", which is obviously a bad idea.
That's a pipe dream. There are a handful of successful abstractions, like streams or associative containers, or strings for that matter. Even those quickly get too abstract for serious use, and you need to actually expose what's really happening to make sensible choices, because there are vastly different kinds of streams or containers (or even strings), because, well, those are abstractions.
Other more far reaching design choices could be: when and how much precomputation is done, data loading strategies, data-flow oriented vs control flow oriented, threading strategies, etc. You can't hide these choices behind a name. They deeply affect the structure of the program.
I don’t understand why silently truncating strings is considered to be acceptable practice.
If you have the length of both, simply assert the length of the destination is <= the length of the source. Problem solved (bugs become obvious in testing).
The more I think about this, the more sense it makes to say you should check the length of the buffer before they copy. I'm not sure an assert is always right, exactly what should be done if the string doesn't fit may depend on what the program is doing. But silently truncating is generally not the right thing to do.
What about user input, strings coming from files, and in general the diversity of things that can go wrong when your product ships. This is particularly important if your product is attacked by those who would like to exploit it.
strcpy_s does this. It has a template variant to infer the length of the destination buffer and it will terminate the program if the source string is too large.
If the destination buffer doesn't have a fixed length then it will fail to compile.
This is similar to your suggestion with these improvements:
- Template deduction of the destination size is done - this is essential to avoid bugs
- Detection of overwrites is done in release builds as well as during testing
- Detection of overwrites is automatic - it doesn't require manually adding asserts to call sites
- strcpy_s exists in some standard libraries
Because the API is stupid and not intuitive. It's for memory buffers where the string is either NULL terminated or it is the length of the memory buffer. This saves an extra byte. yay?
I'd say "Stop using C and C++ already" or "Stop using null terminated strings already", but I recognize that those are not always practical solutions.
I do have some maintainability concerns though about the promulgation of numerous competing home-brew solutions to a near trivial problem throughout a code base.
Well, again, you have to assume that the caller called you right, i.e. that the pointer you were passed points to a buffer with exactly one zero in it, at the end. Furthermore, unless there's also a (potentially wrong) length argument, you have to read every byte of the destination buffer while you're copying and stop when you see a zero. There's a lot to be said for a reimplementation like bstrlib that wraps char arrays in structs and gives you replacements for most of the string functions, but then you still have all these C APIs to deal with that use char*. And you have the overhead of trying to do things right at runtime. Sibling comment mocks languages that do compile-time safety checking, but I totally see the appeal.
With `strncpy`, the caller assumes the responsibility of buffer size (by specify that size explicitly every single time. I thought that responsibility is not being debated within the current context.
That responsibility - for passing the correct buffer size - is one of the issues. I have seen many bugs where the caller simply passed the wrong size to strncpy. If you have a fixed-size buffer then the compiler can infer the size for you, thus making this class of bugs go away entirely.
TL;DR - I've reviewed enough code to know that software developers pass the wrong size to strncpy and friends.
So you need a data structure (or language) that have size embedded in, and the knowledge (and restrictions) of that data structure need to spread to every libraries or function that one use (we are talking about string handling here). It appears to me that you have cornered yourself that the only solution is to modify C/C++ standards to redefine string handling.
I am not saying that is bad. But solutions already exist. One is using the string class, another being transiting to a dynamic language (having size as part of data structure and checking it at every usage is one step into dynamic types).
Alternatively, the compiler can take over this responsibility by statically check every call (the rust solution). That is a different language. On that, I have a question: how can we pass the syntactic buffer size into a static library without embedding the size into the data? If we do embedding the size, then it is still dynamic within the context of the library.
Either way, one should not still be discussing the safety of strncpy here.
Now as long as we are discussing strncpy, I assume it is given that we do not want to have buffer size as permanent part of the simple string type and we are not looking for any solutions that demand an overhaul of the language/libraries. I assume it is given that caller takes the responsibility of buffer size -- exchange with a bit responsibility for simpler system and greater control.
You won't have a good solution (or good discussions) without confining our context first.
You're quite right, it's important to set the context. But no matter what context you pick, your choice is costly:
* We're going to stick with C and its standard library: lots of room to continue to make errors.
* We're going to stick with C, but it's OK to consider an alternate implementation of strings: the C standard, the standard library, and lots of other C libraries are full of functions that deal in char* strings, so you're stuck with them. Also, runtime overhead increases.
* We're willing to consider other languages: other languages have compiler overhead, semantic overhead, runtime overhead, or all three. Also, you still have to interoperate with C.
Barely related, I had a good laugh when Ori Bernstein (creator of Myrddin, look them up) was asked, "So, is your language memory safe?". -"No, it's memory dangerous".
I can't quite tell why, but I found it kind of insightful and it made me feel a little better about my own inclination to work on lower levels.
Of course, but everyone wants to overdo the complexity of the time worn solution. OMG you need to null terminate the string after all the _other_ gymnastics..gee C sure does suck! Why don't we use rust|go|c++ ad-nauseam.
bzero(buf,sz); /memset nazis here/
strncpy(buf,src,sz - 1);
It doesn't make sense to have to null-terminate the string by hand (maybe) after doing a "safe" STRING copy. Which means it's easy to forget to do and that's a dangerous wart in the design.
Really, stop using C. To quote Hayao Miyazaki, C was a mistake.
See the bug? Of course not, because it's not visible, but I've seen this pattern used where dst is a pointer. This means that four or eight bytes (three or seven if you don't count the null terminator) gets copied, regardless of how many bytes dst points to. With strcpy_safe an attempt to copy to dst would not work and the developer would have to fix their code.
strncp is fine, however it was never meant for null-terminated strings! A safe alternative is using snprintf and checking the return value for truncation.
snprintf is hardly safe. It does guarantee null termination, so that is progress, but it requires the caller to pass the buffer size. As I mention in the "More typing means more errors" section of the article software developers are endlessly creative about ways to pass the wrong buffer size. They usually pass the right buffer size, but 99.9% correct still means a lot of bugs. These bugs can be avoided by using a template function to infer the destination buffer size. A template wrapper around snprintf would work very nicely.
> These bugs can be avoided by using a template function to infer the destination buffer size
I'm a big fan of using strategies to avoid bugs. Another strategy than can be a big help is to spend a little time thinking about what kinds of strings you are going to be using and how big they might be. And then sanitizing your program input based on those sizes. Find the too-long strings before they get into your program and before you start copying them around.
The system I'm responsible for came into being as a set of programs on a Univac II and was a bunch of C and FORTRAN when I joined. Because of the need to pass data between C and FORTRAN programs/code, we had a set of 10-20 standard string sizes, based on what kind of text you had. We no longer write new code in either C or FORTRAN, but maintain the concept of standard sizes for our text. By specifying the kind of text (we also do this with numbers) in the design steps, we save far more time and effort than we spend over the entire SDLC.
If the standard libs are that bad, the community should really lobby for the inclusion of a "safe" string copy into the C standard. Failing that, use a battle-tested third-party library like bstrlib[0].
[0]http://bstring.sourceforge.net/