Hacker News new | past | comments | ask | show | jobs | submit login
Strlcpy and strlcat added to glibc 2.38 (sourceware.org)
178 points by synergy20 on July 17, 2023 | hide | past | favorite | 243 comments



As a bit of historical context:

"This is horribly inefficient BSD crap. Using these function only leads to other errors. Correct string handling means that you always know how long your strings are and therefore you can you memcpy (instead of strcpy).

Beside, those who are using strcat or variants deserved to be punished."

- Ulrich Drepper, around 23 years ago: https://sourceware.org/legacy-ml/libc-alpha/2000-08/msg00053...


He is not wrong, is he? If you are using null terminated strings that's the thing you need to fix.

I still support this addition. If you are doing methamphetamine with needle sharing you should stop methamphetamine, but distributing clean needles is still an improvement.


He's not wrong. The main reason to have these functions is that other implementations have them, and programs are using them, and having to define those functions themselves when ported to glibc.

One benefit of defining strlcpy yourself is that you can define it as a macro that expands to an open-coded call to snprintf, and then that is diagnosed by GCC; you may get static warnings about possible truncation. (I suspect GCC might not yet be analyzing strlcpy/strlcat calls, but that could change.)

The functions silently discard data in order to achieve memory safety. Historically, that has been viewed as acceptable in C coding culture. There are situations in which that is okay, like truncating some unimportant log message to "only" 1024 characters.

Truncating can cause an exploitable security hole; like some syntax is truncated so that its closing brace is missing, and the attacker is able to somehow complete it maliciously.

Even when arbitrary limits are acceptable, silently enforcing them in a low-level copying function may not be the best place in the program. If the truncation is caused by some excessively long input, maybe that input should be validated close to where it comes into the program, and rejected. E.g. don't let the user input some 500 character field, pretend you're saving it and then have them find out the next day that only 255 of it got saved.

Even if in my program I find it useful to have a truncating copying function, I don't necessarily want it to be silent when truncation occurs. Maybe in that particular program, I want to abort the program with a diagnostic message. I can then pass large texts in the unit and integration tests, to find the places in the program that have inflexible text handling, but are being reached by unchecked large inputs.


Example:

  #include <stdio.h>
  #include <string.h>

  #define strlcpy(dst, src, size) ((size_t) snprintf(dst, size, "%s", src))

  size_t (strlcpy)(char *dst, const char *src, size_t size)
  {
    return strlcpy(dst, src, size);
  }

  int main(void)
  {
    char littlebuf[8];
    strlcpy(littlebuf, "Supercalifragilisticexpealidocious", sizeof littlebuf);
    return 0;
  }


  strlcpy.c: In function ‘main’:
  strlcpy.c:4:63: warning: ‘%s’ directive output truncated writing 34 bytes into a region of size 8 [-Wformat-truncation=]
   #define strlcpy(dst, src, size) ((size_t) snprintf(dst, size, "%s", src))
                                                               ^
  strlcpy.c:14:22:
     strlcpy(littlebuf, "Supercalifragilisticexpealidocious", sizeof littlebuf);
                      ~
  strlcpy.c:14:3: note: in expansion of macro ‘strlcpy’
     strlcpy(littlebuf, "Supercalifragilisticexpealidocious", sizeof littlebuf);
   ^~~~~~~
  strlcpy.c:4:34: note: ‘snprintf’ output 35 bytes into a destination of size 8
   #define strlcpy(dst, src, size) ((size_t) snprintf(dst, size, "%s", src))
                                   ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  strlcpy.c:14:3: note: in expansion of macro ‘strlcpy’
     strlcpy(littlebuf, "Supercalifragilisticexpealidocious", sizeof littlebuf);
     ^~~~~~~
If glibc doesn't do something in the header file such that we get similar diagnostics for its strlcpy, we can make the argument that this is detrimental to the program.


There is a hierarchy of bugs involved here. Memory safety is a much more serious class of problem. Obstinately refusing to improve the status quo because it doesn't solve all problems is just plain bad engineering. Doubly so in this case where there exist the "n" variants of string functions that are massive foot guns.


Yes, he’s wrong. To apply your metaphor: improvements to the mess that is string handling in C are still an improvement, even if they don’t solve the underlying problem.


Well, the wider problem then is using C.


Pretty much all operating system APIs use C-style zero-terminated strings. So while C may be historically responsible for the problem, not using C doesn't help much if you need to talk to OS APIs.


not using C doesn't help much if you need to talk to OS APIs

This means cdecl, stdcall or whatever modern ABIs OSes use, not C. Many languages and runtimes can call APIs and DLLs, though you may rightfully argue that their FFI or wrappers were likely compiled from C using the same ABI flags. But ABI is no magic, just a well-defined set of conventions.

And then, no one prohibits to use length-aware strings and either have safety null at the end or only copy to null-terminated before a call. Most OS calls are usually io-bound and incomparably heavy anyway.


The problem is, a null-terminated string is a very simple concept for an ABI. A string with a length count seems simple, but there is a big step up in complexity, and you can't just wistfully imagine effortlessly passing String objects around to your ABI.

For a start, String objects are going to be different everywhere. Even in C++, one library's String object isn't going to be binary compatible with another. How is the data laid out, does it do small string optimisation, etc? Are there other internal fields?

So you won't be passing objects around. At the ABI, you'll have to pass a pointer and a length. Calling an ABI will involve unwrapping and wrapping objects to pretend you are dealing with 'your' strings. Simple C-style ABIs make memory management straightforward (n.b. but error-prone, and certainly not easy). If this new style ABI returns a 'string' (pointer and length) of some sort, you have to package it up in your own object format, and manage the memory. Will you need an extra object type to represent 'string I got from an ABI, whose memory is managed differently'?

None of these are insurmountable, but they are a complexity that is rarely thought of when people declare 'C style ABIs are terrible!'


> For a start, String objects are going to be different everywhere. Even in C++, one library's String object isn't going to be binary compatible with another. How is the data laid out, does it do small string optimisation, etc? Are there other internal fields?

I don't really think anyone expects a c abi to have multiple implementation defined string types. They want there to be a pointer + length string interface removing the use of null pointer style strings alltogether.

> If this new style ABI returns a 'string' (pointer and length) of some sort, you have to package it up in your own object format

A c function with proper error, (that is something you want to have for all your interface functions). Normally looks something like this.

int name(T1 param_1, T2 param_2, ..., TN param_n, R1* return_1, R2* return_2, ..., RN* return_n);

Where the return int is the error code. param_1-param_n the input parameters. result_1-result_n the results of the function.

When writing these kinds of functions having an extra parameter for the size of the strings either for input or output is not a huge complexity increase.

> Will you need an extra object type to represent 'string I got from an ABI, whose memory is managed differently'?

Which memory management system you use does not impact if you use null terminated strings or a pointer + length pair. Both support stack, manual, managed or gc memory. It's just about the string representation.

For example:

I use a gc language.

I call a c library which returns a string that I get ownership of.

Now I want to leverage the gc to automatically free the string at some point. What I do is tell the gc how to free it, I have to do this no matter how the string is represented.

Or take the inverse.

I send in a string to the c library, which takes ownership of it.

Now the library must know how to free the memory. Typically this is done by allocating it with a library allocator (which can be malloc) before sending it to the function. Importantly the allocator is not the same as the one we use for everything else.

What I am getting at is that if you are not using the same memory system in the caller and the calle you have to marshal between them always. No matter if you are using null terminated strings or a pointer + length pair.


> pointer + length string interface

If it's a 32 bit length, that will be limiting for some 64 bit programs.

If it's a 64 bit length, it means tiny strings take up more space.

Hey, do both! Have the length be a "size_t" and then have "compat_32" shim around single system call that takes at least one string argument.

Wee!

Imagine a parallel world in which mainstream OS kernel developers had seen the light 30 years ago and used len + data for system calls. You'd now have to be support ancient binary programs that are passing strings where the length is uint16. Oh right, I forgot! We can just screw programs that are more than five years old. All the cool users are on the latest version of everything.

> if you are not using the same memory system in the caller and the calle you have to marshal between them always. No matter if you are using null terminated strings or a pointer + length pair.

Null-terminated byte strings are always marshaled and ready to be sent literally anywhere. They have no byte order issues. No multi-byte length field whose size and endianness we have to know. If they are UTF-8, their character encoding is already marshaled also (that's the point of using UTF-8 everywhere).


>Null-terminated byte strings are always marshaled and ready to be sent literally anywhere. They have no byte order issues.

They have https://en.cppreference.com/w/c/string/wide


Why are you citing documentation about wide strings, in response to a comment about byte strings (that even mentions UTF-8)?


> don't really think anyone expects a c abi to have multiple implementation defined string types. They want there to be a pointer + length string interface removing the use of null pointer style strings alltogether

Not so simple.

32bit or 64bit length? Signed or unsigned? It doesn't make sense to have a signed length.

Zero length strings are easy, what about null strings? Are you going to design the pointer + length strict to be opaque so that callers can only ever use pointers to the struct? If you don't, you cannot represent a null string (IE a missing value) differently to an empty string.

How do callers free this string? You have to mandate that they use a special stringFree function, or rely on callers first freeing the pointer field and then freeing the struct.

Composite data types are a lot more work and are more error prone in C.


We're very much in agreement.

The whole 'null pointer style strings' makes no sense, I think they want to say 'nul terminated'. But fine.

Your examples are excellent, let me add a few more:

Big endian? Little endian? Do we count characters or bytes? Who owns the bloody thing? Can they be modified in place? Are they in ROM or RAM? Automatic? Static? Can they be transmitted over a network 'as is' or do they need to be sent via some serialization mechanism? What about storing them on disk? And can they then be retrieved on different architectures?

The problem really is that C more or less requires you to really know what you're doing with your data and that's impossible in a networked world because your toy library ends up integrated into something else and then that something else gets connected to the internet and suddenly all those negative test cases that you never thought of are potential security issues. So any simplistic view of string handling will end up with a broken implementation regardless of how well it worked in its initial target environment.

C's solution is simple: take the simplest possible representation and use that, pass responsibility back to the programmer for dealing with all of the edge cases. The problem is that nobody does and even those that try tend to get it subtly wrong several times across a codebase of any magnitude.

It's a nasty little problem and it will result in security issues for decades to come. There are plenty of managed languages, I had some hope (as a seasoned C programmer) that instead of this Cambrian explosion of programming languages that we'd have some kind of convergence so that it becomes easier, not harder to pick a winner and establish some best practices. But it seems as though cooperation is rare, much more common is the mode where a defect in one language or eco system results in a completely new language that solves that one problem in some way (sometimes quite convoluted) at the expense of introducing a whole raft of new problems. Besides the fractioning of mindshare.


It's not a hypothesis, the thing was already implemented many times in C, C++ and other languages and used for ages especially for networked code, because C "there's no length" approach is a guaranteed vulnerability.


It's not a guaranteed vulnerability, it's a potential vulnerability.

Guaranteed doesn't mean "this will probably happen", it means "this will definitely happen".

The "no length approach" can probably result in a vulnerability. It won't definitely result in a vulnerability.

I mean, come one, if it was a guaranteed vulnerability, almost nothing on the internet would work because they all have, somewhere down the line, a dependency on a nul-terminated string.

I mean, do you think that nginx (https://github.com/nginx/nginx/blob/master/src/core/ngx_stri...) is getting exploited millions of times per hour because they have a few uses for nul-terminated strings?


nginx whacks one mole at a time https://cve.circl.lu/cve/CVE-2013-2028


That CVE has absolutely nothing to do with length up front vs nul terminated strings. It's also two years old. The only thing it does is reference nginx but that's disingenuous, unless the point you're trying to make is that nginx has the occasional security issue, which I think we're all very much aware of. But it doesn't answer the GPs point in any relevant way.


The problem there is in opportunistic bound checking due to loose association of an array with length, string being an example of an array. This vulnerability is a direct consequence of C "there's no length" approach and shows why this approach in unsuitable for networked code.


In C a string is not an example of an array. If we can't agree on terminology for a discussion that requires extreme precision it becomes difficult to keep going.

Networked code does not as a rule use C style nul terminated strings though, in the case of fixed length buffers they will usually be accompanied either by a length field or by zeroing out the end of the string or even the whole buffer (the latter is much better and ensures you don't accidentally leak data from one session to another).

Networked code doesn't have to be written in C to begin with. Regardless of implementation there usually is a protocol spec and you adhere to that spec and if you don't then you'll find out the hard way why it matters.

This particular vulnerability has nothing at all to do with C strings but in fact has everything to do with a broken implementation of length based strings, which could result in the length being negative, which is at least one problem which C style strings do not have... (small comfort there, they have plenty of other problems, but that one they don't.).

This is the fix for that particular CVE:

https://github.com/nginx/nginx/commit/4997de8005630664ab35f2...

Which stems from integer overflow after doing arithmetic on the lengths.

It looks to me as though you just pulled the first nginx CVE that you found and posted it without looking at what the CVE was all about, without realizing that the ancestor comment was referring to the string implementation inside nginx which lives in the referenced file, whereas you are pointing to a CVE related to the parsing of HTTP chunked data requests, which resides in an entirely different file and has nothing to do with string handling to begin with.


And what do you propose? To let only 1.5 good C programmers in the world write code like in 70s?


> And what do you propose?

That you get your terminology right, back up your claims with links that actually make sense and try to understand that the software world is complex and that incremental approaches make more sense than demanding unrealistic / uneconomical changes because they are not going to happen.

> To let only 1.5 good C programmers in the world write code like in 70s?

No, I did not propose that, you just did and clearly that's nonsense aka a strawman even if you didn't bother throwing it down.

C is here. It will be here decades from now. Rewriting everything is not going to happen, at least, not in the short term. C will likely still be here (and new C code will likely still be written) in 2100, and possibly long after that. This isn't ideal and it's not going to help that we can not make a clean break with the past even though we are trying.

The solution will come in many small pieces rather than as one silver bullet to cure it all and TFA announces two such small pieces and as such is a small step in a very, very long game. The adoption of Rust and other safer (not inherently safe but safer, there are still plenty of footguns left) may well in the longer run give us a chance to do away with the last of the heritage from the C era. But there is a fair chance that it won't happen and that Rust's rate of adoption will be too low to solve this problem timely.

The same goes for every other managed language, they are partial solutions at best. This isn't good news and it isn't optimal, but it is the reality as far as I can determine. If you're going to do a new greenfield development I hope that you will find yourself on a platform where you won't have to use C and that you have skills and resources at your disposal that will allow you to side-step those problems entirely. But that won't do anything for the untold LOC already out there in production and that utterly dwarfs any concern I have about future development, it's the mess we made in the past that we have to deal with and we have to try hard to avoid making new messes.

Think of it as fixing a large toxic waste spill.


It's not a hypothesis, the change happened several times and is used in networking code: in putty and s2n in C and in grpc in C++ and I guess in all C++ code that uses string_view and span, it's easier to happen in C++ due to more language features.

>Rewriting everything is not going to happen, at least, not in the short term.

If you can't do a big task in one go, split it into smaller tasks and do them in sequence.


I'm sorry, I apparently lack the vocabulary or clarity of expression to get my points across to you so I'm bowing out here.


Which C compilers are those then?

Also, you keep writing 'null pointer' and 'null', there is a pretty big difference between 'null' and 'nul' and in the context of talking about language implementation details such little things matter a lot. You say a lot of stuff with great authority that simply doesn't match my experience (as a C programmer of many decades) and while I'm all open to being convinced otherwise you will have to show some references and examples.


What doesn't match your experience?


My experience as a programmer of some 40 years in C has yet to expose me to a C compiler that has length based rather than nul terminated strings as the base string type. Please point me to one in somewhat widespread use rather than an experimental implementation that uses this concept and make sure not to confuse libraries with the implementation of the language.


Since no C/C++ compiler supports it, for them implementation is in a library.


So that means they are not part of C/C++. Which was the point. You can write software in C/C++ but that's hardly news and you can use that to create new data types that are not in the language, which also is hardly news.


People suggesting it are concerned about security, they don't intend it to be a novel invention. Bound checking predates C.


Yes it does. But that doesn't mean that you get to state a lot of stuff with certainty that upon inspection turns out to simply not be true. C programmers are - in spite of what you appear to think - also concerned about security. And whether bounds checking predates C or not has nothing to do with how this is implemented, in a library or in the compiler itself (or even in the hardware).

If you reference C you are talking about the compiler, that, and only that is the language implementation. In C that specification is so tiny that a lot of the functionality that you might expect to be present in the language is actually library stuff. K&R does a poor job for novices to split out what is the language proper and what is the library, but a good hint is that anything that requires an include file isn't part of the language itself.

The original comment to which you responded talked about the ABI, the layer between the applications and the operating system, presumably the UNIX/POSIX ABI, which is more or less cast in concrete by now and unlikely to be replaced because if you do so you introduce a breaking change: all compiled applications using that ABI will no longer work. Some versions of UNIX will occasionally do this and this is widely regarded as a great way to limit your adoption. So the problem, in a nutshell is: how do we repair the security situation that has emerged as the result of many years of bad practices in such a way that our systems continue to work without having to re-invest the untold trillions of $ that have been spent on software that we use every day. This is a hard problem. TFA is a small, and incremental step in trying to solve that problem.

Others are more pessimistic, believe that we should just take our lumps and get on with that rewrite, usually in whatever is their favorite managed (or unmanaged, in some cases) language. Yet others pursue compiler based or hardware based solutions which all introduce different degrees of incompatibility.

I'm somewhat bearish on seeing this problem resolved in my lifetime. At the same time I applaud every little step in the right direction. And I personally do not believe that replacing C's 'string type' (which it really doesn't have other than nul terminated string literals) is the way to go due to the reasons outlined above. But an incremental approach allows for fixing some known issues and allows us to back away from historical mistakes in a way that we can afford the cost and to do so without incurring the penalty of a complete rewrite (which usually comes with a whole raft of new bugs as well). So small improvements that do not address each and every grievance should be welcomed. Even if they no doubt introduce new problems at least the scope is such that you can - hopefully - deal with those without introducing new security issues.


Putty and s2n are examples how this problem is solved, they work on POSIX, e.g. linux, just compile them with gcc and they work.


>32bit or 64bit length? Signed or unsigned? It doesn't make sense to have a signed length.

32 bit should be enough for everyone, it's easier to type as int, and you have less problems with variable sized integers on different targets. Signed length makes sense because length is a number, and numbers are signed, also in conjunction with array -1 sentinel value is often used.

>If you don't, you cannot represent a null string (IE a missing value) differently to an empty string.

C++ can't do it either with std::string and sky doesn't fall, because such distinction is rarely needed and for business logic empty string means absence of value, actually in languages with nullable strings null string and empty string are routinely synonymous and you often use a method like IsNullOrEmpty to check for absence of value. Anyway you need the concept of absence for other types too, like int, so string isn't special here.

>You have to mandate that they use a special stringFree function, or rely on callers first freeing the pointer field and then freeing the struct.

pointer+length struct is a value type, see https://en.cppreference.com/w/cpp/container/span


> C++ can't do it either with std::string and sky doesn't fall, because such distinction is rarely needed and for business logic empty string means absence of value,

Incorrect. I'm literally, today, working on a project where the business logic is different depending on whether an empty string is stored in the database, or no string.

"User didn't get to fill in a preference" is very different from "user didn't indicate a preference".

In more practical terms, a missing value could mean that we use the default while an empty value could mean that we don't use it at all.


For user empty text field means absence of value. Indeed, rarely a situation arises for optional values, but it's not only for strings, other types like int may need it too.


The end user representation of a programming construct versus the implementation details surrounding such constructs give rise to what is called a 'leaky abstraction', in this case that 'absence of value' is something entirely different than 'empty string'.

We have a way of representing absence of value for some data types but not for others, again because of implementation details. This sort of leaky abstraction often gives options for creativity but it can also lead to trouble and bugs. Some languages offer such 'optional' behavior to more datatypes and make it a part of function calling conventions, either by supplying a default or by leaving the optional parameters set to the equivalent of 'empty' or even 'undefined' if that is possible.


Pretty much all string implementations have the ability to give you a pointer and a length which you can then pass on to the foreign interface. Essentially, he API always takes a non-owning string view. C strings on the other hand require you to store that terminating NUL next to the string. This is only bearable because most string implementations are designed to deal with because C APIs are so popular.

For returning strings, ownership is a bigger problem than the exact representation. OS APIs typically make you provide a buffer an then fail if it was not big enough.


>Simple C-style ABIs make memory management straightforward (n.b. but error-prone, and certainly not easy).

The idea is to use C-style memory management: you provide a buffer, where the string is copied, for example of string return see getenv_r function: https://man.netbsd.org/getenv.3

In C++ it's more similar to std::span.


you can't just wistfully imagine effortlessly passing String objects around

To clarify, I didn’t mean it. No new style API/ABI. Only unboxing a string into (str, len) in/out-params and boxing it back from returns.


Lots of C programs define a more substantial string type for themselves (e.g. dynamic, reference-counted strings or what have you), used only internally. Time-honored tradition.


You do like in Windows and define safe strings for ABI, as done for COM API, nowadays the main kind of Windows APIs.


I suspect null terminated strings predate C, C is just just one of many languages that can use them.


The PDP-10 and PDP-11 assemblers had direct support for nul-terminated strings (ASCIZ directives, and OUTSTR in MACRO10) which Ritchie adopted as-is, not unlike Lisp’s CAR/CDR. It’s not entirely clear that other “high-level” languages at the time also used such a type.

Although later ISA added support for it for C compatibility, whereas older ISAs tended to only support fixed-length or length-prefixed, for instance the Z80 has LDIR, which is essentially a memcpy, copying a terminated string required a manual loop.


All non-dynamic string representations give rise to the situations where programmers need to combine strings that don't fit into the destination.

Whether null-terminated or not, dynamic strings that solve the problem of being able to add two strings together without worrying whether the destination buffer is large enough (trading that problem for DoS concerns when a malicious agent may feed a huge input to the program).


Nothing prevents those operating systems from offering custom string types.


In reality, a ton of stuff does. As an example: What do you do if someone calls your new string+length API with an embedded \0 character? Your internal functions are all still written in C and using char* so they will silently truncate the string. So you need to check and reject that. Except you forgot there are also APIs (like the extended attrs APIs) that do accept embedded \0. The exceptions are all over the place, in ioctl calls passed to weird device drivers etc.


Windows internally uses string+length struct, null terminated string API is just compatibility interface on top of it.


*new operating systems

You can't change the string type without breaking all apps and services.


Even on a new OS it's going to be a compatibility problem. Implementing even partial POSIX compatibility makes porting stuff easier, but changing how stings work is going to make it significantly harder.


As a user posting from a Linux machine, I disagree. Though it seems the "don't use C" crowd often delegate the important decisions to somewheres else.

I guess the answer is "some people's C is good enough, but not yours"


If the problem is "you're using nul-terminated strings" as the GP said, then "don't use C" a good step towards fixing that problem, no?


Perhaps, but also realistic to accept that you're using code where other people do/have and that the same logic would apply to them.


You only have to care about it at boundaries though, for the most part. Like, when calling a C API. That's easy to handle. Even C++'s std::string can do that, as the c_str method always returns a null-terminated string. That inherently kills the need for things like strcat.


The return from c_str cannot be used everywhere you would normally use a null terminated string, because the return is const.

For example, you couldn't pass it to strtok, or any other function that needs to even temporarily modify the string.


strtok is an abomination. The only reason it needs to modify the input string in the first place is to support zero-terminated output strings without having to make copies.


While this is true, passing a string to a C function that is manipulating the string would defeat the point of not using C string manipulation.


You may not know the function is doing C string manipulation, since const correctness in APIs is not a 100% thing.


If it's just incidental mutation that is a concern, rather than intentionally mutating C strings, no problem: it is common-place to defensively clone strings and other memory when passing them to untrusted interfaces. In fact, if this is your fear, you have literally no alternative but to do so, even when programming directly in C.

Then again, if there's no contract for who owns or mutates a given piece of memory, there's no safe way to use said API from any language or environment and you should probably stop using it. Failing that, you'd just have to check the source code and find out what it actually does and hope that it does not change later.

(Of course, this has no bearing on whether or not you should use C strings or C string manipulation: You shouldn't, even if you're touching unsafe APIs. It's extremely error prone at best, and also pretty inefficient in a lot of cases.)


@jchw I don't see anything you write as disagreeable. But clearly you have a strong handle on what needs taken care of.


Turtles all the way down isn't it? At some point, someone has to take responsibility.


Let me reframe this. What we're saying to do is stop using C string manipulation such as strcat, strcpy, etc. Particularly, I'm saying simply don't use C-style null terminated strings until you actually go to call a C ABI interface where it is necessary.

The argument against this is that you might call something that already internally does this, to your inputs directly, without making a copy. Yes, sure, that IS true, but what this betrays is the fact that you have to deal with that regardless of whether or not you add additional error-prone C string manipulation code on top of having to worry about memory ownership, mutation, etc. when passing blobs of memory to "untrusted" APIs.

It's not about passing the buck. Passing a blob of memory to an API that might do horrible things not defined by an API contract is not safe if you do strcat to construct the string or you clone it out of an std::string or you marshal it from Go or Rust. All this is about, is simply not creating a bigger mess than you already have.

Okay fine, but what if someone hates C++ and Rust and Go and Zig? No problem. There are a slew of options for C that can all handle safer, less error-prone string manipulation, including interoperability with null-terminated C strings. Like this one used in Redis:

https://github.com/antirez/sds

And on top of everything else, it's quite ergonomic, so it seems silly to not consider it.

This entire line of thinking deeply reminds me of Technology Connection's video The LED Traffic Light and the Danger of "But Sometimes!".

https://youtube.com/watch?v=GiYO1TObNz8

I think hypothetically you can construct some scenarios where not using C strings for string manipulation requires more care, but justifying error prone C string manipulation with "well, I might call something that might do something unreasonable" as if that isn't still your problem regardless of how you get there makes zero sense to me.

And besides, these hypothetical incorrect APIs would crash horrifically on the DS9K anyways.


This thread reminds me of the essay, "Some were meant for C"

https://www.humprog.org/~stephen/research/papers/kell17some-...


C the needle contaminated now often with deadly RCE virus. Historically it was used to inject life into the first bytes of the twisted self perpetuating bootstrapping chain of an eco system dominating today the planet and the space around it.


All processors are C VMs at the end of the day. They are designed for it, and it's a great language to access raw hardware and raw hardware performance.

I still fail to label C as evil.

P.S.: Don't start with all memory management and related stuff. We have solutions for these everywhere, incl., but not limited to GCs, Rust, etc. Their existence do not invalidate C, and we don't need to abandon it. Horses for courses.


> All processors are C VMs at the end of the day.

That would be a poor argument back in the 80s; and is increasingly wrong for modern processors. Compiler intrinsics can paper-over some of the conceptual gap, but dropping down to inline assembly can't be entirely eliminated (even if it's relegated to core libraries). Lots of C code relies on certain patterns compiling down to specific instructions, e.g. for vectorising; since C itself has no concept of such things. C is based around a 1D memory model which has no concept of cache hierarchies. C has no representation of branch prediction, out-of-order instructions, or pipelines; let alone hyperthreading or multi-core programming.

After all, if processors were "C VMs", then GCC/LLVM/etc. wouldn't be such herculean feats of engineering!


This is a subject I love to discuss.

Exactly. C is based around 1D memory, has no understanding of caches. All of your other arguments are true, too.

This is why most of the things; caches, memory hierarchies and other modern things are hidden from C (and other languages, or software in general) itself, to trick C, and make it think it's still running on a PDP-11.

All caches (L1, L2, L3, even disk and caches, and various caches built in RAM) are handled by hardware or OS kernels themselves. Unless they provide an API to talk with, they are invisible and untouchable, unmanageable, and this is by design (esp. the ones baked into hardware like Lx and other buffers).

All the compilers are the interface perpetuating this smoke and mirrors to not upset C about its assumptions about the machine underlying itself. Even then, a compiler can only command the processor upto a certain point. You can't say that I want these in caches, and evict these. These are automagic processes.

Exactly, because of these reason, CPUs are C VMs. They do work completely different than a PDP-11, but behave like one at the uppermost level, where compilers are the topmost layer in this toolchain.

Compilers are such a herculean feats of engineering, because we need to trick that the programs we're building, to make them think they're running on a much simpler hardware. In turn, hardware tries hard to keep this management ovherhead handled by compilers at a bare minimum while allowing higher and higher performance.

More ponderings, and foundation of my assertion is here: https://dl.acm.org/doi/10.1145/3212477.3212479

Paper is titled: C Is Not a Low-level Language: Your computer is not a fast PDP-11.


Caches, memory hierarchies, out-of-order execution, etc. are hidden from assembly as well as C. One reason for this that isn't mentioned in your comment (or the ACM article) isn't that everyone loves C but rather that most software has to run on a variety of hardware, with differing cache sizes, power consumption targets, etc. Pushing all of that optimization and fine tuning off to the hardware means that software isn't forced to only work on the exact computer model it was designed to run on.

The author also mentions that alternative computation models would make parallel programming easier, but this neglects the numerous problems that aren't parallelizable. There's a reason why we haven't switched all of our computation to GPUs.


I don't think I agree completely to your sentiment. Because while we want to make software run everywhere (at least in the X86 family regardless of the feature sets we have), we want to make sure that our software performs well, too. This is esp. important in areas where we (ab)use the hardware to the highest level (games, science, rendering, etc.)

To enable this performance optimizations, we taught our compilers tons of tricks, like -march & -mtune flags. Also, we allow our compilers to generate reckless code like -ffastmath, or add tons of assembly or vectorization hints into libraries like Eigen.

We write benchmarks like STREAM, or other tools which measure core to core latency, or measure execution code with different data lengths to detect cache sizes, associativity, and whatnot. Then use this information to optimize our code or compiler flags to maximize the software's speed at hand.

If caches and other parts of the system would be available to assembly, we would have asked the processor their properties, directly optimize according to their merits, even do some data allocation tricks or prefetching w/o guesswork (which some architectures support via programmable external prefetching engines), not doing tuning in the dark via half-informative data sheets, undisclosed AVX frequency behaviors, or other techniques like running perf and looking cache trash percent, IPC, and other numbers to make educated guesses about how a processor behaves.

Yes, not all stuff is can be run in parallel, and I don't want to move all computation to GPUs with FP16 Half Precision math, but we can at least agree that these systems are designed to look like PDP-11's from a distance, and our compilers are the topmost layer of this "emulation" while doing all kinds of tricks. Trying to push this performance in an opaque way why we have Spectre and Meltdown, for example, where these abstractions and mirrors break down.

If our hardware was more transparent to us, we would have arguably selectively optimize our code a bit easier, if it had the switches labeled "Auto/I know what I'm doing", for certain features.

Intel tried to take this to max (do all optimization with the compiler) with Itanium. The architecture was so dense, it failed to float, it seems.


This is backwards. C was conceived as a way to do the things programmers were already doing in assembler, but with high(er) level language conveniences. In turn , the things they were doing in assembler were done to efficiently use the "VM" their code was executed on.


I have linked a paper published in ACM Queue in another comment of mine, which discusses this in depth.

The gist is, hardware and compilers are hiding all the complexity from C and other programming languages while trying to increase performance, IOW, emulating a PDP-11 while not being a PDP-11.

This is why C and its descendants are so long lived and performs very well on these systems despite the traditional memory models and simple system models they employ.

IOW, modern hardware and development tooling creates and environment akin to PDP-11, not unlike VMs emulate other hardware to make other OSes happy.

So, at the end of the day, processors are C VMs, anyway.


What a crazy metaphor! You're equating using zero terminated strings in C to doing drugs.


What's up with people seeing an analogy and going "you can't equate those two things"? Analogies aren't equating things


Analogies are great since they talk about how things are the same, and just as terrible because they talk about things that are different.

But seriously it’s sometimes hard to slice out what level of similarity is implied. Obvious things are somewhat less obvious to others sometimes


I feel like the success rate of getting someone off of null terminated strings is probably lower than most rehabilitation programs.


We can't entirely because of the C ABI but apart from that it's as simple as not using C which is not too difficult. C is not a popular language these days.


“Apart from that” does a lot of work here: FFI layers generally talk nul-terminated string unless otherwise specified, so do syscalls.


Yes that's what I said. You can generally wrap those layers so you aren't actually manipulating null terminated strings; just converting to/from them which is not too bad.


I don't know what you're relying on for the idea that C is not a popular language, but it is extremely popular.


Well, you will need to give up SQLite if you really feel this way, and reimplement it in a safe language.

It will also be some time before Rust has substantial penetration into Linux; you might need to find a kernel that implements the POSIX interfaces safely.

These will not be easy problems to solve.


Yeah, no…


I mean it’s a wash, on the one hand zero terminated strings have done untold amounts of damage[0] and are impossible to extirpate once they’re in, on the other hand the nazis were methed (and coked later on) up their eyeballs.

[0] and not just in C itself, unexpected truncation through FFI is an issue which regularly pops up


odbc defines multilingual interface that can accept both null terminated and length bounded strings by using NTS sentinel value for null terminated string length.


-edit- I'm not a C programmer, nor do I have any opinion on whether api is garbage or less worse or whatever.

They seemed useful enough to get added to the other BSDs, Solaris, Mac OS X, Irix(!), QNX, and Cygwin as well as used in the Linux kernel.


Distributing clean needles is useful, yes, but you should still lament why it is necessary.


The Linux kernel has better options, notably strscpy.


Imho its pretty simple: Strings in C are 0-terminated char arrays. If the char array is not 0-terminated, its not a string.

strncpy() can make a string into a non-string (depending on size), which is clearly bad.


That’s because strncpy does not return a nul-terminated (“C”) strings, but a fixed-size nul-padded strings.

That, as it turns out, is the case of most (but not all) strn* functions.

Of course strncpy adds the injury that it’s specified to alllow nul-terminated inputs (it stops at the first nul byte, before filling the target buffer with nuls).


It also, in some situations, returns a *string" that doesn't have the null terminator, which means it is giving the caller something that literally isn't a string.


It always “returns” the same thing: a fixed size nul-padded buffer. Call it a char array if you want, that’s always been it’s role and contract.


> Strings in C are 0-terminated char arrays

To be pedantic, they're pointers to char. Nothing more. Calling them array confuses non-C coders. The length is just an unenforced contract and has to be passed.


It's a pointer to a chunk of memory which contains an array of characters. You pass around the pointer because copying an array is expensive and wasteful.

I think (or hope) the concepts are pretty clear if you understand what a pointer is.


strncpy was a bad mistake. If you know the length and there's no null termination, you use memcpy instead.


strncpy isn't good either. But using length delimited strings is the best way to generate fixed length char strings and NUL terminated strings.


I'm surprised they didn't go with strscpy() directly

https://archive.kernel.org/oldlinux/htmldocs/kernel-api/API-...


Because strlcpy exists in bsd since 1999: https://man.netbsd.org/strlcpy.3


HN discussion around this quote, around 12 years ago: https://news.ycombinator.com/item?id=2378013


> Correct string handling means that you always know how long your strings are

Well, I couldn't think of a stronger argument against NULL terminated strings than this. After all, NULL terminated strings make no guarantee about having a finite length. Nothing prevents you from building a memory mapped string that is being generated on demand that never ends.


Except that's a non-sequitur because you can totally keep separate string length fields.

The only NUL that C requires is the NUL following C string literals, and you can even easily define char-arrays without NUL.

    char buf[5] = "Hello";
or even

    #define DEFINE_NONZ_STRING(name, lit) char name[sizeof lit - 1] = lit "";
Can also easily build pointer + length representations, without even a runtime strlen() cost.

    struct String { const char *buf, int len; };
    #define STRING(lit) ((String) { (lit ""), sizeof lit - 1 })


What do you do when the strings might have more than MAX_INT characters?


What will you do on your 200th birthday?

In case you're more interested in theory than practice, I have a different answer: I use a different API.

However, I'm aware not even that could stop you, because you could still ask "what do you do when the strings might have more than SIZE_MAX characters?", which is entirely possible (as a combination of 2 or more strings).

And to answer that, we're coming back to my original answer: It doesn't happen. I'm not calling the API with such huge strings. (And no, I usually don't keep formal proofs that it couldn't happen -- there are also an infinite number of other properties that I don't verify).


INT_MAX is often far less than SIZE_MAX (the former is usually the max of a signed 32-bit integer, the latter of an unsigned 64-bit integer), so usually nothing special.


SIZE_MAX is the largest possible value of type size_t. size_t is defined as an unsigned type that is big enough to represent the size of the largest possible object (which basically means the size of the virtual address space i.e. 2^32 on a 32-bit system and usually 2^48 on a 64-bit system, which is being addressed with an uint64_t).

None of that is relevant since you're extremely unlikely to hit either limit by accident. If you really want, you can hit 32-bit limit if you're doing things that snprintf really shouldn't be used for, and likewise you can hit size_t limit if you're on a 32-bit system and joining multiple large strings.


Yes, my point is just that since all the "strn" C string-handling functions in the standard library use a size_t for the size if you've got more than INT_MAX characters there's not necessarily any problem. INT_MAX is pretty much always going to be lower than SIZE_MAX, even on 32-bit systems since the former is signed and the latter isn't. You just call snprintf or whatnot as usual. If you manage to have more than SIZE_MAX characters, then you have a problem. Libc probably can't solve it for you though, since SIZE_MAX has to be large enough to cover any allocation so you have some sort of segmented architecture that the C standard library isn't expecting.


If that is ever a possible issue, you switch the implementation to use two pointers.

    struct String { const char *buf, const char *buf_end; };


Actually I was answering this question wrong because I somehow understood it in the context of snprintf() return int, and I should have just replied "you can switch to size_t if you like". start + end pointer is certainly not necessary, not sure why one would ever do this. It's more inviting of bugs compared to start pointer + length.


It is how many languages implement strings without being bound by numeric limits.

Naturally for this to work out without bugs, it cannot be exposed directly, only manipulated via a string library.


size_t is large enough to hold the size of the largest possible object in memory. In practice, on most architectures, that means it is the same size as pointers. I'm not sure if there is a case where start + end pointer can describe a valid string in memory that start pointer + size couldn't? If that was the case, that string wouldn't be an "object" by definition.


Yep, and what if I want and to make an arbitrarily large array without much copying?


What's the point of the empty string literal "" ?


It's a poor man's assertion that the "lit" is indeed a string literal (such that we can get the string length using sizeof) and not a variable (of pointer type, where sizeof returns the size of the pointer not the string-buffer). If you pass a variable instead of a string literal, that will be a syntax error.


Or more likely strncpy plus forced last NUL. Return a flag on truncation unlike messing with return code or errno.

Call it safe_strncpy and be done with it. Otherwise asprintf and snprintf exist. strlcpy is a more garbage version of snprintf.


He was a jerk, but often he had a reason for his abusiveness. Was the reason in this case valid?


The question is: Is string truncation a good solution when the strings you have are unexpectedly long? Like, it's probably ok in a lot of cases, and once you start using these functions, it's very tempting to use them almost everywhere... but truncating "Attack at dawn on Friday" to "Attack at dawn" could be a disaster as well.

On the other hand, his recommendation to always know string lengths and use memcpy didn't really become common practice over the last 20+ years either, so I'm not sure it was worth all the arguing.

At this point, I'm kind of joining the camp of "C has proven to be too bug-prone for most organizations to use safely and therefore we should all go to Rust".


The second part "and therefore we should all go to Rust" does not follow necessarily from the first. Maybe the reason not everybody is gone to Rust is that it lacks something. Maybe we will all go somewhere else.


It lacks developer ergo omics, for me personally.

Source is for humans to read, it shouldn't look like alphabet soup for the idiomatic cases.


I suspect the eventual end result is major compilers start implementing a "fat pointer" string ABI for internal translation units (decaying to char * at the edge where necessary) and people start turning that on.


> On the other hand, his recommendation to always know string lengths and use memcpy didn't really become common practice over the last 20+ years either, so I'm not sure it was worth all the arguing.

It hasn't become common practice in C. But other languages (like JavaScript or Python) have become hugely popular, and don't use null-terminated strings.


Even languages in C's niche encode strings as pointer + length, like Rust.


> On the other hand, his recommendation to always know string lengths and use memcpy didn't really become common practice over the last 20+ years either

It was the way plenty of languages from the 70s stored their strings, including such popular ones as BASIC.


It has in the sense that people allocate strings much more than using fixed-size, stack-allocated arrays.

Modern C uses things like glib's GString, which (in addition to keeping the NUL terminator) track the length and can resize the underlying memory. And people also use a lot more asprintf instead of strcpy and strcat.


> but often he had a reason for his abusiveness

There is never, ever, under any circumstances, a reason to be abusive.


Not really; he was frequently a jerk right out of the starting gates for no particular reason. That quote is the initial reply to the proposed patch, and the only "reason" I see for the insults is to satisfy Drepper's own emotional needs. It's petty and pathetic.

This is very different from e.g. Torvalds who sometimes rants a bit after someone who he feels ought to know better screwed up. I'm not saying that's brilliant either, but I can be a lot more understanding when people are abrasive out of passion for their project after something went wrong.


Well, he does actually have a point. strlcpy is a faster (well, safer) horse than strncpy, but it's still a horse. We should not use horses as the main mode of transport anymore.

"Doctor, it hurts when I strcpy — so don't do that".

He's being a jerk about it, but I would not say that he doesn't have a point.


Merely "having a point" is not "a reason for his abusiveness". I think I "have a point" for almost any HN comment I post (or at least, I'd like to think so) and have just as much "reason" to be a jerk as Drepper had. This applies to most posts on HN.


Ah, true. I think I cross-read comments here. Sorry.


Mostly no. True, the C NUL-terminated string definition is bad, but it's baked into the API. You need some semi-sane way to work with it that isn't 'Everyone writes their own wrappers to memccpy' (some people will get that wrong - e.g. the Linux manpage stpecpy wrapper just begs for misuse, and it's what most initiate C programmers will see if they know enough to check manpages).

strlcpy may not be the best API, but it's sane enough and by now ubiquitous enough to deserve inclusion. Had glibc and others engaged we may have had a better API. Regardless, glibc should never have had such a long veto here.


No.


Yes.


Why?


Inefficiency probably doesn't need any comment (these functions traverse string twice instead of once). His argument that string length should be always known is correct in theory although not in practice.


Can you name a program that runs too slowly because it uses strlcpy?


You're looking at it wrong. strlcpy is defined to be slow in certain cases. The API requires it. Other interfaces may be slow today but can be improved in the future because they don't have a return value that is inconvenient. (Notably, memccpy today is typically a memchr followed by memcpy, since this is faster than a naive implementation. Obviously if it gets used more then it will get replaced with a single-pass, machine optimized implementation.)


As the top level comment was about knowing the length of a string: GTA Online's loading times were atrocious because of a null-terminated string.


Not really, more that the implementation of sscanf() is stupid and calls strlen() even though implementing sscanf() that doesn't require that is perfectly possible.


Instead of putting up with people constantly complaining how C is bad because of zero-terminated strings, we should better educate folks that there is absolutely zero reason why one has to rely on a NUL byte in-band signal. And APIs like sscanf() shouldn't be used beyond their historic purposes and there are easier ways to program.

C doesn't really "have" zero-terminated strings other than supporting them with string literals as well as having an atrocious "strings" library for historical reasons. C has storage and gives you the means to copy data around, that's it.

(Although I fully agree that the GTA issue can be seen as a bug in the implementation of sscanf()).


People typically do not realize that it has a return value that is expensive to compute.


It's not any slower in the typical case where destination buffer is large enough to fit the source thing. And if that's not the case then we are most likely in a error case (either caller notices the truncation and decides to abort, or ignores the truncation and things may soon go boom), and not many people care about optimizing error paths.

Furthermore, when coders dont't have strlcpy() the alternatives are often even worse than strlcpy(): 1) They use strcpy() and have buffer overflows. 2) They use strncpy() which is slower than strlcpy() in the common (non truncating) case, and in the truncating case leave the string unterminated (thus segfault potential) 3) They use snprintf(dst, len, "%s", src); which is strictly slower than strlcpy()


Since the error path is the largest one (the string doesn’t fit…) it makes sense to bound its execution. I would not recommend the others FWIW for exactly the reasons you mentioned.


Why would you optimize for the error case and not the common case? You've already done an unbounded amount of work copying the string in from the network or wherever. If anybody cared that much, they wouldn't let the string get that long in the first place.


It can be appropriate to bound the runtime of certain components of a system while allowing looser constraints elsewhere. For example I would perhaps not want to do an O(n) string operation on a collection of strings even though the user would be pretty upset if they can’t paste infinite input into my app.


It's only as expensive as what you pass in. Joke's on you.


qsort is also only as expensive as what I pass in. If it did a bubble sort internally I would be pretty upset though.


What? snprintf is nothing like doing an O(n^2) computation when O(n log n) was expected.


Right, it’s more like O(m) when you probably wanted O(n).


[flagged]


I don't think OP intended this quote to glorify Drepper. He is correctly regarded as a giant asshole. Very smart, but also an awful person to work with.


Back in the 00s when Ruby was hot, the Ruby community had a remarkably constructive and helpful attitude. Even when offering criticism. Many folks attributed it to its creator with the acronym, MINASWAN ("Matz is nice and so we are nice").

No community is perfect, but once you've seen how good it can be it's hard to have much patience for brilliant assholes.


I credit this more than anything for the success of Ruby. Just like I credit the 'holier than thou' attitude of the proponents of some other languages for their relative lack of success compared to where they could have been by now.

Dutch proverb, not sure if it translates or if there is a better English version: you catch more flies with sugar than with vinegar.


The English version is "you catch more flies with honey than with vinegar", which at least in English makes more sense, since in English "sugar" generally implies dry granulated sugar. You're not going to catch any flies with that. (Ironically, you'd probably catch more with the vinegar, since some would go to it for the moisture and a few would drown.)

/tangent


I'll substitute syrup then ;)


[flagged]


> Or just another do-nothing internet blowhard?

I don't know about the OP but you are crossing the line here.


Linux uses strscpy. See [1] [2] [3]. The issues of concern are to always NUL-terminate, and to effectively know if the result was truncated.

Truncation can lead to big issues, especially if the string being composed refers to paths, device names, other resources, etc. For example you may truncate a path from /foo/bar/baz to /foo/bar and inadvertently operate on other files. An API that makes this confusing is dangerous.

See the confused deputy problem description [4].

[1] https://mafford.com/text/the-many-ways-to-copy-a-string-in-c...

[2] https://lwn.net/Articles/659214/

[3] https://docs.kernel.org/core-api/kernel-api.html#c.strscpy

[4] https://en.wikipedia.org/wiki/Confused_deputy_proble


Here's discussion on why strscpy shouldn't be included in POSIX:

https://www.austingroupbugs.net/view.php?id=986 (scroll to 0002897)


Boy do I miss MantisBT...


It's not a very good argument. Notably:

> strlcpy() fits within the existing set of functions like a glove. strlcpy(a, b, n) behaves identically to snprintf(a, n, "%s", b). The return value always corresponds to the number of non-null bytes that would have been written. If we truly think that this is bad design, should we come up with a new version of snprintf() that also doesn't do this? I don't think so.

People typically do not consider snprintf and strlcpy to be a similar family of functions. There's no need to transpose the weird behavior to a new string copying routine.


You're latching to one minor point and ignoring main points like 'According to POSIX we can't define strscpy's return value', and 'We are standardizing a widely used function while strscpy is used in exactly one place - and a strscpy which fixes the return value is used nowhere'.


I don’t actually care what it returns as long it’s something that doesn’t take O(n) to compute. Feel free to rename it too if you don’t want it to be confused with strscpy.


A lot of what POSIX does is standardize current use. strlcpy was used just about everywhere in Unix - it's just that on Linux devs used a jury-rigged implementation because glibc.

There's a case for strlcpy standardization even when it's not the perfect function. A standard strlcpy will allow compilers to look at all the cases where the return value isn't checked and replace the strlcpy implementation with an as-if strlcpy implementation not requiring O(n) compute for return value - otherwise, the compiler needs to peek behind the curtain and see if the local jury-rigged strlcpy implementation (which may be called by a different name like g_strlcpy) matches the 'real' strlcpy, which it very likely does, but proving it is a different matter.*

Now, if we want to create the perfect string function, there's a case to do that separately. IMHO, strscpy isn't bad if we fix the return value issue.

* I'm assuming the build system will ensure projects use the 'standard' strlcpy where available, which I think is reasonable since they mostly already do that when it's not Linux.


I guess you are right and my actual annoyance is that people are using this function and I don’t think they should in many cases. I would’ve liked the function to mostly fade from use and not be standardized as a result…


Let this be a warning of what happens when the (g)libc folks refuse to consider programmer needs and offer no solution. Had glibc offered any semi-sane solution, they'd have won by marketshare alone and their solution would have been used everywhere instead. By the time Linus thought of strscpy, it was too late.

>'We want a semi-sane null-terminated string copy function'

>'All you need is memccpy, la-la-la'

>(Everybody runs away screaming, even the Linux kernel folks decides to create a string function)

>(OpenBSD has a ready solution and a decent enough reputation, almost nobody checks it)

>Versions of strlcpy are embedded everywhere.

>glibc is forced to implement strlcpy.



I fear all of this is dancing around the nasty core of the problem: generic writing to C style strings can't be done without extra information. You can't write stuff to memory without negotiating how much room is needed and available, and optionally moving the string. Silent truncation will cause bugs. Buffer overflow even more.

Fixing this now is hard: Writing to 0-ended strings require manually tracking lengths. Expanding a string without allowing malloc is misery.

The only way out I see is basically starting from zero: ISO C should define an API with a (pointer,current length, max length) struct at its core, pointer pointing to a 0-terminated C string. You can read it, but changing it requires using functions that can error out and/or malloc more memory. There are already multiple libs like this, but C has none. If the struct would be ABI, non-C programming languages can pass strings between them.


C had the opportunity to include this but they did not. It is my understanding that they wanted to design everything in C as inherent to the language, rather than magic types, especially a struct. There is an elegance in the notion that a string is just an array of characters. If I’m working with a significant amount of strings in C, I can keep track of lengths, not a huge deal.


Exactly this. There are no literals in C that create composite types. There are no composite types inherent to the language. All these types are defined in (system) includes.

And zero-terminated strings are not strictly worse than other length-prefixed string forms. They save some space -- sure, less relevant today -- as well as provide an in-band termination signal -- which is hacky, again sure, but it is convenient when looking at a hex dump for example.


There are literals that create composite types, since C99: https://en.cppreference.com/w/c/language/compound_literal


> There are no literals in C that create composite types

Float/double literals do


how so?


Because a float contains bit fields including sign of the exponent, sign, mantissa, sign of the mantissa. It's a bit of a pedantic argument but technically it makes sense.


You could call them composite in that sense, but in C, composite types are types that are composed of other C types (structs, unions, arrays, ... functions? Not 100% sure of the specifics).

Also, the representations of floats and doubles isn't precisely specified, at least IEEE754 is not a strict requirement (not sure about the technical implications from what's actually specified).


I'm not really sure I get why the distinction between composite and non-composite types is important if the only difference is that you can't easily access sub-parts of the non-composite types.


I guess we can cook up some arguments about language purity and orthogonality. Introducing literals that create composite types might indeed create some difficulties for existing compilers that want to follow the latest standard.


What kind of difficulties? I don't understand why this would make any difference from a compiler's perspective.


It requires the compiler to be able to create types even before parsing, to have composite types without any source code location... you'd have to check with a specific compiler but it's not hard to imagine some potential issues.

There's also outward facing changes that introduce complexity... new literal types (+ new syntax) are required because otherwise old could would break (sizeof...) etc. etc. Existing parsers and tooling will have to be adapted, too...

Generally it's nothing that should be insurmountable, but why accept any of that if there's nothing to gain except a feature that is requested mainly by detractors of the ecosystem, a feature that goes squarely against the grain of the language...?

Any language is proud to be able to implement its features as library components around its core capabilities. Why introduce a new C type as a structure that can easily be defined as an aggregate type in source code?


I'm not the one making the argument, merely explaining why the GGP may have chosen this example.


Wasn't implying that! :)


An early version of C didn't have structs, the initial attempt to get the OS off the ground failed, and after adding structs it worked. Structs are just syntactic sugar over memory offsets relative to a base pointer, a construct for which many CPUs include primitives.


C is lots of magic and quirkiness.

This reminds me. From a spec/design perspective Ceylon was the cleanest language I know. Almost everything, including a lot of the keywords were actually defined in the standard library. The fact that Integer was actually a Java int behind the scenes was just a compiler implementation detail. There was very little "magic". If you wanted to know how something in the language worked you could just look at the standard library.


You can't even really assume that strings are writable, they might well be in ROM on an embedded device.


NULL terminated strings are a fact of life in many cases and C just needs lots of string functions to cater for different use cases. e.g. usually for me truncation is worse than crashing (corrupted data basically).

When I read arguments they are full of people thinking that their one size fits all and that somehow having too many variations would be bad.

This seems illogical to me since I've had to write my own string copying routines enough times because the one that fitted my need wasn't commonly available. Purism with C is just stu.... well anyhow.

A "good" fat pointer library for C would help a lot - something that could pop a NUL onto the end when you needed to put the string into an OS function but I would also have to groan at the idea that NUL termination should be outlawed in some way. At the C level you want options not limitations.


Yeah agree. Roll your own linked lists (or rather: avoid them but if you have to use them, roll your own, it's like 30 min to 2 hours at most), your own btrees, your own ring buffer. I guess the gnu hashmap is fine 90% of the time, but if your use case is weird, you might want to run your own too. And BTW the glibc isn't really bad, just that in C, generic solutions will miss your specifics edge cases.

And this is true for strings too. If you use only static strings in your whole project, the standard is enough, but I won't write an IRC server or a posix Shell without my own strings (not anymore :p).


odbc interface can accept both null terminated and length bounded strings by using NTS sentinel value for null terminated string length.


It must be snowing in hell right now. :)

It says that they might be added to POSIX.

edit apparently Ulrich Drepper (major glibc contributer & former glibc leader is back at Red Hat [0]

https://research.redhat.com/blog/project_member/ulrich-drepp...


They're already in draft 3 of the forthcoming 202x revision. The glibc work resulted in a request to clarify the draft 3 specification: https://www.austingroupbugs.net/view.php?id=1726


I'll bet he would be a laugh to work with.


Perhaps working at Goldman Sachs mellowed him out some.


I hope not, as entire reason he went there was to punish the bankers for their role in the 2008 financial crisis ;)

(https://www.reddit.com/r/linux/comments/zzhd9/comment/c69cam...)


So what is RedHat being punished for now?


Depends who you ask.


I work with him and he's smarter than almost everyone.


I don't think anyone is questioning his intelligence. But it seems his manners have not always been able to keep up with his intellect.


Which is a very different thing than "I would want to work with him".


If you ever find yourself writing strcpy() followed by strcat(), consider:

  snprintf(buf, sizeof(buf), "%s%s", a, b);
This as safe as strlcpy and strlcat, but more efficient, and has been standard for 24 years.


1. This returns an int.

2. This is inefficient in a similar way that strlcpy is: it does more work than the size of the buffer.


1. This is only slightly less irrelevant than "this returns a size_t" if that were so.

2. You have the option to provide the length, snprintf(buf, sizeof buf, "%.*s%.*s", len1, str1, len2, str2);

If you're bottlenecked by snprintf (hint: you aren't) then snprintf isn't your API anyway. Write some more custom code, probably some memcpy's etc.


For 1: returning an int means that you get some unspecified behavior on overflow. For 2 then you need to call strlen yourself which kind of defeats the point of using snprintf because you can just use memcpy instead.


For 2: No you don't? There is no reason to fault snprintf() if you didn't already know the length of the string.

And when you do know the lengths, no, there absolutely is a reason to use snprintf -- convenience. snprintf(buf, sizeof buf, "%s/%s", dirpath, filename); is much easier to write than an equivalent sequence of manual copies with temporary index variables and pointer arithmetic.

Same for snprintf(buf, sizeof buf, "%.*s/%.*s", dirlen, dirpath, filelen, filename); if you really cared about squeezing the last drop from the API.

For 1: Then don't overflow. It's not practical to process strings that big to allow a 32-bit overflow (or 16-bit overflow, on a 16-bit system), so it's unlikely anybody here has ever been in that situation anyway.

Apart from that I'm not so sure that the API is specified to allow overflow to happen. It's probably an under-specified area of the contract, but I would first check that the result couldn't be -1 for example.

Apart from that, what would be the reasonable specification in case of size_t overflow that would result in a controllable situation?

Sometimes I think, C being probably better formalized than any other language is also a big reason for it being criticized so much. Language nerds just love to take the specs and try to shred it on theoretic grounds without any consideration of the practical.


I open files that are larger than 4 GB from time to time, it’s not really reasonable to say that 32-bit is large enough for these kinds of things anymore. Well, let me rephrase: sometimes it’s ok to be “ok I don’t handle more than 4 billion (say this is a field to enter your name)” but there should also be a way to do it if I care enough, like when I’m writing a text editor.

FWIW I believe most implementations will do something safe on overflow like terminate the program or return some error (can printf signal via errno?)


Honestly just use memcopy and define your own string structure.

C has void*, that allows you to implement easily modifiable data structures. There is a bit of 'NIH' syndrome in what I'm saying, I'll admit, but in the end it's better imho.


C is all about memory management and copying data around, and people can't stop whining that there is no magic string handling sauce (as if strings were special), and keep acting like we had to put up with the ridiculous strcat() etc. nonsense.


That's not what 'void*' is for.


That's how I use it though. As a data-agnostic type, so I can put whatever I like in my btrees/buffers/linked lists.

I don't use it for string management, I guess reading my post again, I expressed myself poorly, again.


Be careful, this does not work if buf is 'char*' rather than an array type.


asprintf is even better (although different because it allocates a new string).


Interesting to see the implementation is basically a two-pass process: strlen() to count the source string, followed by a memcpy().

My intuition would be to give some importance to not looping on the source string radically beyond the length of the destination buffer.

https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strl...


25 years on. Congrats OpenBSD


I'll never understand how Linux won over BSD.


Just make your own string functions from scratch when using C, you'll thank me later.

First, you make two structs: str_buf { capacity, len, data[] } and str_view { len, data* }.

Then write your string handling functions to write into a str_buf* and read from str_views. You have the length and capacity available so memcpy is easy and safe to use.

The str_buf treats capacity as 1 less than it really is such that it can ensure there's always a stupid 0 at the end for compatibility with APIs that expect 0 terminated strings.

There you go, no more security bugs, no more nonsense.


> The str_buf treats capacity as 1 less than it really is such that it can ensure there's always a stupid 0 at the end for compatibility with APIs that expect 0 terminated strings.

Off-by-one errors are a thing.

> Just make your own string functions from scratch when using C, you'll thank me later.

No, if you're going to use C and you need a string type use a well supported string library so that you don't end up reinventing the wheel (probably in a buggy way) and benefit from the battle testing that that code has gone through.

If we're looking at actual strings (as in text) then I'd use 'libunistring'.


What I'd like to have is snprintf_l(). It's not standard, but it's available in FreeBSD, macOS, and Windows (as _snprintf_l()). Just not in glibc (and probably musl).


snprintf_l() is not standard, but you can simulate it with uselocale() to save and restore the per-thread locale, which is standard (POSIX.1-2008) and supported by Glibc, FreeBSD, MacOS and other OSes, though Windows requires its own, different way of doing per-thread locale.

It seems it was once an explicit design choice in Glibc. Here (https://sourceware.org/bugzilla/show_bug.cgi?id=10891) a Glibc ticket refers to documentation "Thread-aware Local Model, A Proposal" of the plans for the _l() functions (https://akkadia.org/drepper/tllocale.ps.gz) long ago, which are nowadays implemented in Glibc.

It looks like snprintf_l() and other <stdio.h> functions were not part of that plan.

That plan or something like it also made its way into POSIX.1-2008, so it seems likely that the committee gave some thought to including strftime_l() and not snprintf_l().

The paper linked above includes a rationale that tiny, potentially performance-critical functions like isalpha() need a fast version that takes a local parameter, thus isalpha_l(), because of the overhead of fetching a thread-local value inside the function.

Perhaps the intent is that only those tiny functions, or even macros, whose performance would be greatly affected by the cost of fetching the thread-local locale, need a _l() version. That mostly makes sense for the functions which have _l() versions in Glibc. But with that rationale, I don't see why there is strftime_l() but not snprintf_l().


I want to output a lot of numbers (double) and be sure they are in "C" locale (generating JSON output). So it's actually fprintf_l. I can set/restore LC_NUMERIC in a wrapped function, but having a _l version would be nicer.


The end of an era, a rather pedantic and overdrawn one.


So was Drepper wrong? Did he just get worn down? Or did it not involve him at all?


I understand Drepper pretty much considers string.h a lost cause, and I can’t fault him for that.

It’s rather that POSIX decided to add the strl functions, so adding them (and verifying their semantics) is a necessity.


A purist would support removing strcpy, etc... but that wouldn't go well, so adding an improvement is acceptable.


> A purist would support removing strcpy, etc...

That is very much Drepper’s position:

> Correct string handling means that you always know how long your strings are and therefore you can you memcpy (instead of strcpy).

> Beside, those who are using strcat or variants deserved to be punished.


They should have added a warning for strcpy 20 years ago and shipped the compiler with it default enabled.


I think the fact that it will be added to POSIX was decisive.


I think he is doing other things these days https://research.redhat.com/blog/project_member/ulrich-drepp...


He isn't involved anymore, is he?


There are two things I have learned from this thread:

1. glibc has been handling things with a lot more civility post-Drepper, and while it's had as hard a time as any OS infrastructure project, that's helped a lot when it came to managing more polarizing issues.

2. The greatest enemy to wider Rust adoption is not that C-educated programmers are reluctant to learn and apply more solid principles. The greatest enemy to wider Rust adoption is that, despite having ample material about the past mistakes of their forerunners, the Rust community is only learning from those that relate to language design and ignoring all the other ones.


> The greatest enemy to wider Rust adoption is that, despite having ample material about the past mistakes of their forerunners, the Rust community is only learning from those that relate to language design and ignoring all the other ones.

What? How is that your takeaway from the thread?

There's only a few mentions of Rust and none of them are abrasive.


There are quite a few abrasive replies upthread from some Rustaceans. They don't mention Rust by name, just like not every one of Drepper's mails contained the world "glibc", but they're in the same vein.

This is particularly important at a point in a language's lifetime when community support is not just the best, but usually the only kind of support you can get. I like Rust and I'm very productive with it, but if anyone thinks I'm going to ask junior devs on my team to put up with the kind of stuff I see upstream, they're wrong. Just because we developed a thick skin for it on FOSS mailing lists back in the nineties doesn't mean everyone needs to.


If you see poor behavior in the project, call it out. I would personally be interested in being made aware of it to help eliminate it, but the moderation team[1] exists specifically to deal with this.

1: https://www.rust-lang.org/governance/teams/moderation


Life's too short.


I'm not sure I understand what you mean by "upthread". Do you mean in this HN thread? Or in a mailing list associated with the commit OP points to? If so, do you have a link to said mailing list?


Yeah, I meant up in this HN thread.


Well then, again, I really want to protest that "the Rust community is ignoring all the [social] mistakes of the past" is a pretty harsh and unfair judgment of the Rust community.

There's only a few mentions of Rust in this thread, they're all pretty tentative and polite. The mistakes of the past include stuff like people hurling insults at each other and calling people idiot for not using a given technology.

What I'm seeing in this thread is, at most, strong-ish opinions that C is systematically bad and maybe the solution is switching to Rust. That's not being abrasive, that's being opinionated.

(And yeah, I know that I'm sealion-ing this a bit; but I do think when people say stuff like "community X is abrasive and didn't learn from the past", a non-null burden of evidence should be expected)


"Opinionated, not abrasive" is one of the top 5 excuses I've heard from people who just liked to be abusive. You're free to place the border between the two wherever you want. As far as I'm concerned, things like defending not a maintainer's technical choices (which would be fair) but their abusive behaviour, or likening a technical choice with substance abuse or sharing needles, are neither tentative nor polite, "opinionated" though they may be. These are things our industry should have outgrown a long time ago.


Kudos and congrats to Todd Miller and the OpenBSD folks! [1]

  [1]: https://man.openbsd.org/strlcpy.3


Urgh more str functions.

I look forwards to forgetting about this and/or discovering a new foot gun.


Just ignore them. I suspect they're adding them (to POSIX) for portability of historic software. Here is an issue tracker I've found (Haven't read through it; not interested): https://www.austingroupbugs.net/view.php?id=986


Abbreviations are just bad code. Always. Tradition is no excuse for writing bad code. Names of everything should be sufficiently descriptive that absolutely anyone will know its purpose upon first glance.


This mere suggestion will annoy many C programmers but I completely agree with you here. With all of the ioctls, the atois and the strncpy_ls I just stopped trying to understand what the names are supposed to mean and use them for their effects only. strlcpy may as well be called qpdixhwnai for all I care, I'll have to look it up when I need it anyway.

I've learned C on Windows and the Windows API is friendly in comparison to the C API. When Windows beats you in API design, you should really reconsider some of your policies.

Is mem_copy really that much worse than memcpy? Why not memcopy? What do we gain by leaving out that single o? Why is settimeofday not sttmod if munmap is how you write memoryunmap?

It feels to me like POSIX is still being optimized for people manually poking holes into punchcards. We've had autocomplete for decades now, a few extra characters won't hurt, I promise.


Look no further than /bin to see how strong such conventions can be. Mnemonics, function names (and filenames too!) were short because memory was super expensive and likely the first resource bottleneck you'd hit while building anything significant.

What you grow up with is what you consider to be normal and I totally get it why you'd balk at strstr or other cryptic names (or LDA or ls, for that matter) but to me they look perfectly normal and are part of my muscle memory. See also: QWERTY and the piano keyboard for mechanical analogues.


In C89, external symbols were only guaranteed to have 6 significant characters, so both "mem_copy" and "memcopy" get truncated.

And in modern times, I suspect it'd just be thematically weird to have strcpy, strcat, and safeCopyString in string.h, so old conventions still stick around.


Then what happened to creat?


K&R C compiler restrictions, possibly. Never bothered to read its source code.


Original C compilers only guaranteed comparing up to 6 characters in an external symbol, which I think is part of the reason why many nanes are so short.


I'd be okay with renaming strcpy() to string_you_big_dummy_copy()


You joke, but this is almost reasonable. When refactoring a large codebase riddled with strncpy, strcpy and strcmp, understanding unambiguously what code does shouldn't come down to my middle aged eyes being able to parse better than a compiler. I did a global search and replace with a #define, verified the object code diff'd against the original version, and never looked back.


As usual I'm joking but somewhat serious. Step one is better replacement functions. Step two actually should be make the bad ones feel sleezy.

One thing I think is the problem with making safer string functions is it's hard to do that while staying at the same very low level of abstraction. And I think a lot of code out there sets up string functions to work off incomplete information. (here is a pointer to a string buffer, trust me it's big enough to hold what you'll stuff in it)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: