Hacker News new | comments | ask | show | jobs | submit login
Stop using strncpy already (2013) (randomascii.wordpress.com)
53 points by ohjeez 5 months ago | hide | past | web | favorite | 78 comments



In principle, I feel like "Ignore the language-provided standard libraries and use this little macro I wrote in all your code" is bad advice.

If the standard libs are that bad, the community should really lobby for the inclusion of a "safe" string copy into the C standard. Failing that, use a battle-tested third-party library like bstrlib[0].

[0]http://bstring.sourceforge.net/


The C standards committee is incredibly conservative, only adding new features if absolutely necessary. Unless the composition and goals of the committee drastically change, a new standard string API will never happen.

If you must write C, you should use a real string library, and not string.h.


Nonsense, strncpy and friends are just fi2•∟1F↔Ñ


Looks like a Perl script I read once.


They should have a superset of the C standard for writing "secure C", as is important in some kernel or application code.


strcpy_safe is a template function, not a macro. That immediately makes it orders of magnitude better than you were claiming.

I agree that a standard fix for this would be great. It's been five years since I wrote the post, the solution has been available for many years longer than that, and the standards committee has shown no interest in fixing this issue. In the interim I think it would be irresponsible for developers to continue using strncpy or any of the other unsafe options when strcpy_safe is so simple and effective.


What choice do you have? “Fix the standard” is ideal, but few programmers have the luxury of waiting a decade for the fix to happen and become available in all their target compilers.


Strings are so tiny an abstraction over plain memory that every C programmer ends up doing them their own way. And that can be a good thing. I already proposed a couple of ways to represent or muck with strings in my other comment.


How is that a good thing? There are ways of abstracting/handling strings that are strictly safer than other ways. In an unsafe language like C, you're inviting problems by trying to do things in your own idiosyncratic way.


Sometimes performance or flexibility are important concerns. But feel free to manage millions of small strings with std::string or even as GC'ed objects :). And incur the overhead of indirection when attaching bounded-size strings to other data.


Defaults are important. C++ provides a much safer string type as default and let's programmers do their own thing in case it doesn't work for their use case (I'd try providing a different allocator before writing a different string type tbh). C on the other hand doesn't really provide a string type at all and forces everyone to start by writing a string library if they don't want to hunt buffer overflows until hell freezes over.


Rust manages to do it just fine...


How?


Apparently IBM and Unisys mainframe systems programming languages did just fine.


In what way could ever that be a "good thing" over having 2-3 implementations of strings in the standard library that cover 99% of the different ways and trade-offs, and which each has a method to convert a string to any of the others?


Rather use safeclib. It is C11. It has a safe version of the wrongly spec'ed truncating functions, in this case strncpy_s. This version guarantees NULL termination.

https://rurban.github.io/safeclib/


Unfortunately C doesn't have a widely accepted dependency/build manager either.


strcpy_s, anyone?



strcpy_s is a good option for those cases where the appropriate response to having too long an input string is terminating the program. However it is a poor choice when continuing execution is desired.


Two years after this post, the Linux kernel added strscpy, the api of which is equivalent to the safe strncpy in this post. Internally, it stops copying once it reaches the null terminator.

https://lwn.net/Articles/659214/


The article's proposed implementation uses a template to guarantee that egregious mistakes will fail to compile. This is clever, but it's C++. The kernel's version isn't quite as safe because it can overrun the destination. In principle it's impossible for it to be safe, because this is C, and neither the compiler nor the programmer can tell how big the destination is. We have to rely, as the article complains, on the programmer calling the function to know how big the buffer is.


I've had patches rejected from the Linux kernel that tried to switch away from strncpy in favor of strlcpy. The behavior that strncpy continues to zero out the rest of the destination in the case where size of source is less than destination was being relied upon to not leak uninitialized memory to userspace. Seems strscpy alone suffers the same.


strscpy seems like a nice interface. Is an implementation outside of the kernel available? Preferably one that is permissively-licensed?

Also, it doesn't appear to be specified what happens if the count argument is too large to be represented as a ssize_t. The destination buffer would have to be extremely large, so it probably doesn't happen in practice, but it'd be good to specify it, or at least explicitly state it's unspecified / undefined.

https://www.kernel.org/doc/htmldocs/kernel-api/API-strscpy.h...


Also Microsoft's strncpy_s has been around forever.


But Microsoft's strncpy_s is unsafe. My safeclib also had the unsafe versions behind the non-default ./configure --enable-unsafe switch.

But this year I decided to make them safe instead. In this case the spec is broken.


It's too bad strncpy_s is so unwieldy to use. Having to pass the TRUNCATE flag is frustrating. I really want a two argument (dst, src) function that guarantees null termination.


I feel like all of these articles can be summed up with:

"C strings and string functions aren't designed to be safe. If you want safe strings use a library."


> strlcpy is designed to solve the null-termination problems – it always null-terminates. It’s certainly an improvement over strncpy, however it isn’t natively available in VC++.

Let's not forget that it's also not natively available in glibc. Though you'll find it in every BSD you can think of.


Sadly, it still has this braindamage:

  Like snprintf(3), the strlcpy() and strlcat() functions return the total
  length of the string they tried to create.  For strlcpy() that means the
  length of src.  For strlcat() that means the initial length of dst plus
  the length of src.
So if you've mmaped in something like the OpenStreetMaps xml dump file and try to strlcpy out the first 100 bytes (because you didn't want to have to null terminate it yourself after a memcpy), you'll notice that your program is running rather slowly.


"Let's read all the data you don't want because!" nice


That is trusting user input. You don't do that, right?


It's not about trusting user input. It's asking for a function to create a 100 byte substring of a very long string and expecting it to take time~100 not time~len(src).

In fairness, that is probably just not the target "market" for strlcpy. Presumably, it is meant for "please copy this whole string which I expect to fit in the target (but catch me in the rare case that it does not)".


It does trust the user input. If you are being attacked and the input has lost its null terminator then your strlcpy might core dump even if it doesn't leak data.

They made it this way to make it more of a drop-in replacement for strncpy, but IMHO they should have changed the return value to be the number of characters copied. If your return value is less than n, then your string was copied completely, if it is equal to n then your string was truncated.

There are good reasons to use strlcpy over memcpy. For example you have a text parser where the most common case is each string is only a few bytes long, but you need to be able to handle odd cases where they are much longer. So you have a large buffer that you only use a tiny chunk of most times. With strlcpy it will be quick, but memcpy will be chugging the whole buffer each time.


As you note: your example is specious.


What example? Are you confusing me with GP commenter perhaps?


Don't treat mmapped files as strings.


I feel like there are a lot of good alternatives to C. What we did on our team was get our C compiling under g++, and then refactor as C++14. Now string copies are trivial and safe. I'm sure it would have been even safer to switch to Rust or something, but no one in my company has any experience with that, while there were a few C++ gurus.


What are the good alternatives? You can either copy memory manually, or have a system like C++'s that causes lots of tiny allocations, which is very inefficient. Both in space and time. You can also opt for immutable strings to be able to share strings more conveniently and efficiently, at the cost of O(n^2) build time (where automatic optimizations fail) or decreased convenience (explicit string-stream aka string builder).

If efficiency is not a concern for you and you rather like the convenience, good for you.

Personally I feel plain const char pointers (plus length, depending on expected size) are good enough to emulate the immutable string type. For modifiable strings, I still prefer viewing them as chunks of memory, using char pointer + length + capacity. Nice thing is, you can easily go from the former to the latter in the lifetime of any one string. You can also pool many strings in large chunks if you build only one string at any one time.


You care about that after a profiler proves that it is actually a problem to the expected delivery SLAs for the application.

Micro-optimizing each line of C code based on gut feeling is not an option if quality matters.


This is a horrible stance and this is repeated like broken record. Please stop saying this? A good software engineer should know (1) a suitable data structure to solve the problem in hand (2) algorithms that have certain asymptotic characteristic that makes most sense for the problem (what assumptions can you make). "Premature optimization is the root of all evil" can be used for and only for "optimization". Choosing the right data structure is never an optimization, it's a design choice. You can't just start coding everything with a linked list and then optimize bottlenecks, that's just not how it works.


As you say "A good software engineer should know a suitable data structure to solve the problem in hand".

So I expect a good software engineer on my team to be able to design a data structure where the fact that a linked list is being used doesn't hinder to change it afterwards to an array, a B-Tree or what have you.

Without changing a single line of client code I might add.

Such good software engineer will certainly had attended an algorithms and data structures class on their CS degree and be aware of architecture designs based on Abstract Data Types.

Yes, even in C with its flaky translation units instead of proper module systems, it is possible to implement Abstract Data Types application architectures, which any good software engineer can easily tackle.

And for those that aspire to reach that level, here are a few examples.

"Algorithms + Data Structures = Programs" by Niklaus Wirth, 1st edition is Modula-2, the 2nd in Oberon

"Introduction to Algorithms" by Thomas Cormen, Clifford Stein, Ronald Rivest, Charles Leiserson

"Abstract data types and Modula-2: a worked example of design using data abstraction" by Richard Mitchell

"Abstract Data Types in Modula-2" by Rachel Harrison

"Abstract Data Types and Algorithms" by Azmoodeh, Manoochchr


Your comment still doesn't make sense, even though you decided to choose a certain ADS and then latee decide on the implementation later based on empirical performance data, this is still not unusual. For (1) this is not how people write C. There are ADS libraries in C with which you wouldn't need need to use any DS that you can't change but this is not a usual way to write C code. This will also come with the overhead of pointer indirection and worse cache performance since all your objects will be void* stored in the DS (unless you m4 or macro generate your DS with all datatypes. This is what linux does for example, and I sometimes use this technique too, but this hardly the usual way to write C code). And (2) usually complex problems require special enough datastructures that it doesn't make much sense talk about ADSs. Everyone can code basic foobar with List interface then decide to use ArrayList or LinkedList based on performance. It's more interesting when the programmer decides to use a set, a hash map, a bloom filter etc to solve the problem as any other type wouldn't make sense.


We will have to agree to disagree, then.

Sofware quality and meeting delivery SLAs should trump micro-optimizing cache access per code line.

So if SLAs define 100ms per access, no need to go crazy making everything under 5ms.

Thankfully, the Linux foundation has another view on the matter.

https://www.youtube.com/watch?v=XfNt6MsLj0E


It's not about the code, but about choice of data structure. Unfortunately that has to be planned ahead to some degree. You can't really optimize it after the fact. Typically a choice affects more than a single line of code, but that doesn't mean "microoptimizing every line of code", which is obviously a bad idea.


Sure you can, that is what Abstract Data Types architecture design is all about.

Naturally, when one creates struts and accesses the fields directly there is no way out of a bad design.


That's a pipe dream. There are a handful of successful abstractions, like streams or associative containers, or strings for that matter. Even those quickly get too abstract for serious use, and you need to actually expose what's really happening to make sensible choices, because there are vastly different kinds of streams or containers (or even strings), because, well, those are abstractions.

Other more far reaching design choices could be: when and how much precomputation is done, data loading strategies, data-flow oriented vs control flow oriented, threading strategies, etc. You can't hide these choices behind a name. They deeply affect the structure of the program.


A pipe dream delivered into production around 2000, for a distributed CRM platform running across Aix, Solaris, HP-UX, Red-Hat Linux, Windows NT/2000.


Whatever alternative reliably prevents buffer overflows, so pretty much anything but char*.


I don’t understand why silently truncating strings is considered to be acceptable practice.

If you have the length of both, simply assert the length of the destination is <= the length of the source. Problem solved (bugs become obvious in testing).


The more I think about this, the more sense it makes to say you should check the length of the buffer before they copy. I'm not sure an assert is always right, exactly what should be done if the string doesn't fit may depend on what the program is doing. But silently truncating is generally not the right thing to do.


What about user input, strings coming from files, and in general the diversity of things that can go wrong when your product ships. This is particularly important if your product is attacked by those who would like to exploit it.

Asserts are necessary, but not sufficient.


strcpy_s does this. It has a template variant to infer the length of the destination buffer and it will terminate the program if the source string is too large.

If the destination buffer doesn't have a fixed length then it will fail to compile.

This is similar to your suggestion with these improvements: - Template deduction of the destination size is done - this is essential to avoid bugs - Detection of overwrites is done in release builds as well as during testing - Detection of overwrites is automatic - it doesn't require manually adding asserts to call sites - strcpy_s exists in some standard libraries


Because the API is stupid and not intuitive. It's for memory buffers where the string is either NULL terminated or it is the length of the memory buffer. This saves an extra byte. yay?


I'll add that it's really handy for that specific situation as it also avoids leaking uninitialized bytes.



I'd say "Stop using C and C++ already" or "Stop using null terminated strings already", but I recognize that those are not always practical solutions.

I do have some maintainability concerns though about the promulgation of numerous competing home-brew solutions to a near trivial problem throughout a code base.


If we preset the end of buffer with '\0' and then assume the buffer having 1-less capacity, wouldn't it address the strncpy issue?


Well, again, you have to assume that the caller called you right, i.e. that the pointer you were passed points to a buffer with exactly one zero in it, at the end. Furthermore, unless there's also a (potentially wrong) length argument, you have to read every byte of the destination buffer while you're copying and stop when you see a zero. There's a lot to be said for a reimplementation like bstrlib that wraps char arrays in structs and gives you replacements for most of the string functions, but then you still have all these C APIs to deal with that use char*. And you have the overhead of trying to do things right at runtime. Sibling comment mocks languages that do compile-time safety checking, but I totally see the appeal.


I am not sure I follow the extent of the problem.

With `strncpy`, the caller assumes the responsibility of buffer size (by specify that size explicitly every single time. I thought that responsibility is not being debated within the current context.


> I thought that responsibility is not being debated within the current context

I bring it up because the article makes programming errors on the part of the caller part of the case for its "safe" solution.


That responsibility - for passing the correct buffer size - is one of the issues. I have seen many bugs where the caller simply passed the wrong size to strncpy. If you have a fixed-size buffer then the compiler can infer the size for you, thus making this class of bugs go away entirely.

TL;DR - I've reviewed enough code to know that software developers pass the wrong size to strncpy and friends.


So you need a data structure (or language) that have size embedded in, and the knowledge (and restrictions) of that data structure need to spread to every libraries or function that one use (we are talking about string handling here). It appears to me that you have cornered yourself that the only solution is to modify C/C++ standards to redefine string handling.

I am not saying that is bad. But solutions already exist. One is using the string class, another being transiting to a dynamic language (having size as part of data structure and checking it at every usage is one step into dynamic types).

Alternatively, the compiler can take over this responsibility by statically check every call (the rust solution). That is a different language. On that, I have a question: how can we pass the syntactic buffer size into a static library without embedding the size into the data? If we do embedding the size, then it is still dynamic within the context of the library.

Either way, one should not still be discussing the safety of strncpy here.

Now as long as we are discussing strncpy, I assume it is given that we do not want to have buffer size as permanent part of the simple string type and we are not looking for any solutions that demand an overhaul of the language/libraries. I assume it is given that caller takes the responsibility of buffer size -- exchange with a bit responsibility for simpler system and greater control.

You won't have a good solution (or good discussions) without confining our context first.


You're quite right, it's important to set the context. But no matter what context you pick, your choice is costly:

* We're going to stick with C and its standard library: lots of room to continue to make errors.

* We're going to stick with C, but it's OK to consider an alternate implementation of strings: the C standard, the standard library, and lots of other C libraries are full of functions that deal in char* strings, so you're stuck with them. Also, runtime overhead increases.

* We're willing to consider other languages: other languages have compiler overhead, semantic overhead, runtime overhead, or all three. Also, you still have to interoperate with C.


Even as advocate of mostly safe systems languages I don't see a way of replacing C in the context of UNIX based platforms.

So the main issue is how to make C devs actually adopt some kind of safer C dialect and migrate towards it.


Barely related, I had a good laugh when Ori Bernstein (creator of Myrddin, look them up) was asked, "So, is your language memory safe?". -"No, it's memory dangerous".

I can't quite tell why, but I found it kind of insightful and it made me feel a little better about my own inclination to work on lower levels.


Myrddin looks interesting, thanks for sharing.

And I also find funny the remark. :)


Of course, but everyone wants to overdo the complexity of the time worn solution. OMG you need to null terminate the string after all the _other_ gymnastics..gee C sure does suck! Why don't we use rust|go|c++ ad-nauseam. bzero(buf,sz); /memset nazis here/ strncpy(buf,src,sz - 1);


It doesn't make sense to have to null-terminate the string by hand (maybe) after doing a "safe" STRING copy. Which means it's easy to forget to do and that's a dangerous wart in the design.

Really, stop using C. To quote Hayao Miyazaki, C was a mistake.


[flagged]


Posting like this will get you banned here. Please review and follow the rules: https://news.ycombinator.com/newsguidelines.html.


What I meant is more like:

    #define BUFSIZE 100
    char buf[BUFSIZE+1]
    buf[BUFSIZE]=0 // for global var, it is already zeroed, right?
Then one can call strncpy(buf, BUFSIZE, ...) without worrying about null-termination.


Been a while since I used C++...

How does this work if the size of the destination isn't known at compile time? I.e., it's not "buffer[5]”?


You get a compile error.

And this is progress. I have fixed code that looked something like this:

    strncpy(dst, sizeof(dst), src);
    data[sizeof(dst)-1] = 0;
See the bug? Of course not, because it's not visible, but I've seen this pattern used where dst is a pointer. This means that four or eight bytes (three or seven if you don't count the null terminator) gets copied, regardless of how many bytes dst points to. With strcpy_safe an attempt to copy to dst would not work and the developer would have to fix their code.

Every mistake that can be made, will be made.


The author states

> The focus of this post is on fixed-size string buffers, but the technique applies to any type of fixed-length buffer.


strncp is fine, however it was never meant for null-terminated strings! A safe alternative is using snprintf and checking the return value for truncation.


snprintf is hardly safe. It does guarantee null termination, so that is progress, but it requires the caller to pass the buffer size. As I mention in the "More typing means more errors" section of the article software developers are endlessly creative about ways to pass the wrong buffer size. They usually pass the right buffer size, but 99.9% correct still means a lot of bugs. These bugs can be avoided by using a template function to infer the destination buffer size. A template wrapper around snprintf would work very nicely.


> These bugs can be avoided by using a template function to infer the destination buffer size

I'm a big fan of using strategies to avoid bugs. Another strategy than can be a big help is to spend a little time thinking about what kinds of strings you are going to be using and how big they might be. And then sanitizing your program input based on those sizes. Find the too-long strings before they get into your program and before you start copying them around.

The system I'm responsible for came into being as a set of programs on a Univac II and was a bunch of C and FORTRAN when I joined. Because of the need to pass data between C and FORTRAN programs/code, we had a set of 10-20 standard string sizes, based on what kind of text you had. We no longer write new code in either C or FORTRAN, but maintain the concept of standard sizes for our text. By specifying the kind of text (we also do this with numbers) in the design steps, we save far more time and effort than we spend over the entire SDLC.





Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: