Hacker News new | past | comments | ask | show | jobs | submit login
Efficient string copying and concatenation in C (redhat.com)
169 points by beefhash 33 days ago | hide | past | web | favorite | 170 comments



I can't see how incrementally improving the approach of the C standard library is the right design decision here. They need to just include a standard, well designed dynamic strings library to C, that stores the length alongside with the (binary safe) string itself, and that's all. Every serious C program uses it, null terminated strings are a joke. However because of this huge background with null terminated strings in C, such library should make sure to always automatically terminate strings with a null term, so that people can trivially print them, call strlen() against them and so forth when needed and when there is no binary data inside.


A lot of C developers I know see value in keeping the C standard library small. I implemented my own `l_string` (length-string) struct and associated libraries for a class (we weren't allowed to use external libraries) and it was not prohibitively difficult. Something like this probably covers most use cases: https://github.com/antirez/sds


Sadly, real C standard libraries are anything but small. glibc is pretty huge; statically linking a simple hello world program produces something like 750KB of output (libc.so itself is nearly 2MB in size). A quick peek at glibc's documentation shows just how much stuff there is: https://www.gnu.org/software/libc/manual/html_mono/libc.html.

Unfortunately, I wouldn't call all of it critically useful stuff. For example, "strfry" and "memfrob" are useless beyond toy applications as neither are remotely secure; "l64a" and "a64l" are useless (and possibly even dangerous) because they look like they implement base64 but with a different alphabet (and a much worse API); they have a hilariously large number of random number generators, _none_ of which produce high-quality random numbers for simulation or especially cryptography; etc. etc.

Neither MSVC nor macOS will let you truly statically link a binary (because the system interfaces aren't constant), so it's harder to directly compare, but on macOS "libSystem.dylib" depends on basically everything in /usr/lib/system which is about 5MB of code.

So yeah, I don't think I'd call many real libc libraries "small" by any stretch.


Here is a web page comparing the different C library implementations:

http://www.etalabs.net/compare_libcs.html

For the .a size:

musl: 426k

uClibc: 500k

dietlibc: 120k

If you statically link to these, your final binary will only pull what is actually needed and be smaller than this.


I know they are smaller, but why are they able to only pull in what you use when other libc implementations can't?


Any that allows static linking will get the same benefit.

For the three mentioned in the ancestor thread:

Apple doesn't ship a static libc, so developers must dynamically link.

glibc does allow static linking, but it is something they don't really go out of their way to promote and I recall some historic pains trying to get it to work right. Maybe that all works better today.

MSVC does allow static linking to msvcrt. MSVC is a little different than other platforms because they have permutations (and associated rules) you must be aware of: [debug | release], [static | dynamic], [single-threaded | multithreaded].


glibc is only so bloated because of locale und unicode support. Also 32bit wchar_t and intl errors.

Adding better string functions or CPU-specific optims is nothing compared to that.

musl is mostly so small because it only does the C locale and basic unicode towlower. Nothing like the monstruous glibc unicode tables or internationalized error messages.


I don't see value in keeping the C standard library small in this regard, since everybody is going anyway to implement this kind of code from scratch, it's like not having stdio.h for FILE* operations. However... the library you linked is surely a good pick according to me since I'm the author myself :-D


I really love his trick of returning a pointer to the inside of his own allocated memory which is then compatible with C's stdlib string functions, while having his own string functions use pointer arithmetic to get the real struct back out and operate on that. Genius! This is the kind of thing that keeps drawing me to C every once in a while. C is almost like a puzzle, to figure out how you can accomplish what would be trivial in JavaScript, but with all these restrictions caused by an almost extreme simplicity. (Granted, C is not actually simple when you start hitting all the weird edge-cases and surprising UB, etc.)


I don't love this trick. To get this to work, they had to typedef the string type to char*, which means the API will also accept any normal string and probably crash at runtime. So to get this little convenience feature, type safety was sacrificed.


Seems to me that it would also prevent tools like the address sanitizer and valgrind from detecting some invalid memory accesses (corresponding to negative indexes). A "more conventional" string library [1] might be safer.

[1] for example: https://github.com/websnarf/bstrlib/blob/master/bstrlib.txt


> A lot of C developers I know see value in keeping the C standard library small.

And what value what that be?

... if you're going to say "easier to fit a small implementation into the memory of some microcontroller" - not a sufficient argument. There's always some smaller system with not enough memory, and larger systems with more than enough.


This is C land. You're not going to make it far by making the case that bloat is acceptable.

People have a range of technical and aesthetic reasons for hating bloat and C attracts a lot of them.

Off the top of my head:

- Smaller memory footprints

- Easier reimplementation

- Lower attack surface

- Less room for breaking changes

- Less baggage for when we inevitably wish we could deprecate things


> Smaller memory footprints

Nobody said you need to link the entire library.

Plus - with custom-made libraries, the memory footprint is no smaller.

> Easier reimplementation

Implementation begins with some part of the standard library, and gets completed gradually later on.

> Lower attack surface

Are you really arguing in favor of "everyone roll your own library" as a reduction of attack surface?

> Less room for breaking changes

C's standard library was already broken in various significant ways to begin with (e.g. gets() ...) .

Also, are you really worried about all those breaking changes between C99 and C11?

> Less baggage for when we inevitably wish we could deprecate things

Umm, you do realize things usually get deprecated when they've been replaced by something more relevant, right? In our case - new C library code.


> - Lower attack surface

wut?


String formatting libraries, and other things that work with buffers are frequent attack surfaces for buffer overflow attacks.

Granted, using C means developers often implement these operations themselves which introduces the possibility of creating more attack surfaces. But it's less likely that the standard library presents an attack surface when the standard library is tiny.


I mean, C's attack surface is like that of activated charcoal. I'm not sure that C's small standard library gives it a smaller attack surface, specifically because it means programmers who have better things to do are forced to reinvent the wheel, poorly [1]. But mostly, because C's lack of guardrails means it takes active effort on even trivial operations to be safe.

I've been working with it for nearly two decades, and every year I think more that C programs should be confined to a well-guarded quarantined area with hazard trefoils and a "beware of the leopard" sign.

[1] https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule

edit: people reimplementing their "safe" string library isn't something to brag about, but be ashamed of our entire industry for.


Actually not having dynamic strings and buffers in the library is one of the main reasons people do pointer math and reallocations by hand introducing memory corruption bugs.


> And what value what that be?

A lot of embedded development still have very tight requirements. We're still talking about things discussed in tens or if generous, hundreds of kilobytes.

As a more mainstream example you might have heard of, the Arduino, I think the baseline Arduino has around 32KB available.

Also, in the embedded space, the compilers used are notorious for not having a complete standard library implementation. Lots of things are missing. Thus, developers who care about porting their code to different chips, are cautious about limiting what standard library functions they depend on.

Again, as a more mainstream example you might have heard of, look at Lua, which is implemented in 100% ANSI C. Because they care about Lua truly being portable to everywhere, they try to constrain their dependence on the standard library, and they supply some #ifdefs in luaconf.h to implement or workaround some standard library functions they know to be often missing.

And as a more mainstream example of an incomplete C standard library, look at Android. Bionic is the name of their C standard library. It is missing a lot of things. (And you have to use one of those #ifdefs in luaconf.h to compile Lua successfully on Android to workaround Bionic.)


There's some nuance to be had here. The important implications of a larger C standard library aren't technical, they're social--i.e. the implications of a larger standard library for teams using C are negative. Let's look at two other languages: C++ and Python.

C++ has a large standard library. This ends up being one of C++'s biggest downfalls: the various libraries don't play nicely with each other, have many different ways of doing the same or similar things, each have complex gotchas--and it's impossible to know them all, so you never really know C++. It's full of dark corners, and the result is bugs, and difficult debugging, and teams that try to enforce only using subsets of the language (and inevitably fail a little). This problem is only getting worse over time. The reason C++'s new features haven't just been backported back into the C standard is that people saw this coming and didn't want it happening. C is already a complex language, but it's still simple enough that you can actually reasonably read the entire C standard library (I have; QED).

A large standard library doesn't have to be this way. Python has a large standard library--this is the "batteries included" of Python. The reason Python can get away with this is that they can deprecate their mistakes. And they are: quite a few modules in the standard library will disappear with the Python 2 sunset. But this is not without downsides--everyone who has been working on a major Python project in the past few years has felt the pain at some point. And C is not Python: it's the backbone of a lot of other applications, so breaking changes are far more costly for C than for Python. C is "once supported, always supported", and for very good reason, so we can't afford to introduce mistakes to the standard.

So, if you don't want the problems of all the cruft in C++'s standard library, and you don't want the deprecations of Python's standard library, your only remaining choice, by process of elimination, is to have a very slim standard library, and make VERY conservative choices about what to add to it.

I think that we should add a new string library to C when strings are a solved problem. But contrary to what people are saying, strings aren't a solved problem. We shouldn't add a string library that doesn't solve internationalization, for example. And in fact, with that problem, the fox is in charge of the henhouse, making the problem worse: we've got people in the Unicode team adding fsking emoji to the standard. Unfortunately, I think we're going to have to wait until Unicode collapses under its own bad decisions and is replaced by a better standard, and that standard becomes mature, before committing to that proven solution. Otherwise we end up with ICU-related crap into the "once supported, always supported" C standard library.


> C++ has a large standard library.

Is it that large? i.e. is it larger than, say, Java's ?

> it's impossible to know them all, so you never really know C++

This is semantic quibbling. You know a language more if you know its commonly-used libraries better - whether they're in the standard or not.

> the various libraries don't play nicely with each other

They may not be in perfect synch, but they play ok-ish-ly with each other - as far as I can tell.

However, not playing nice is much much more of a problem with non-standard libraries - in various languages. I'm sure C programmers have that experience with a vengeance... say, everybody using different implementations of basic data structures like queues and lists and hash tables and so on.

> It's full of dark corners

Yes, yes it is. And they're pretty scary... however, that's not really because of the library size. Also, those dark corners exist, unexplored, for C libraries; it is arguably better that they be mapped once and for all, at least some of them which are somewhat-frequently stumbled upon.

...

More generally - the issue you're describing seems to be more with the combination of "iron-clad backwards compatibility" and a larger standard library:

- C has the former, not the latter - Python has the latter, not the former

C++ has the former (upto minor points like the `auto` keyword), as well as the latter, but with warts.

The warts are there, that's true, but they're not so bad. And use of the standard library does pay off very nicely - often removing warts in your own code.


> Is it that large? i.e. is it larger than, say, Java's ?

Large, yes. Larger than Java's? I don't know.

It's larger than C's, that's for sure.

> This is semantic quibbling. You know a language more if you know its commonly-used libraries better - whether they're in the standard or not.

If it's in the standard library you better bet it gets used, even if only by your dependencies.

> I'm sure C programmers have that experience with a vengeance... say, everybody using different implementations of basic data structures like queues and lists and hash tables and so on.

Eh, it's not too bad in C. Libraries tend not to force you to use their data structures too much, and if they do, you just don't use that library. Some, you use their data structures because that's what you included the library to get. It's certainly a problem, but dependency management is easier than having dependencies you can't make go away.

> The warts are there, that's true, but they're not so bad. And use of the standard library does pay off very nicely - often removing warts in your own code.

So use C++, if you like it.

My post wasn't an attack on C++. C++ has its upsides. I only described the downsides of C++ because I was explaining what C is trying to avoid.


Here's a stupid but somewhat relevant case: Recently, I saw a surprisingly insidious variation of the "FizzBuzz" question in C. Like the usual FizzBuzz, but with a twist: You are required to return your answer as "char (star)" with a separator "\n", the memory should be allocated dynamically, assuming the user calls free(). A naive implementation that strcat() / realloc() the string and repeats it a thousand times, a better solution is to overestimate the required memory and malloc(). Adding a different constraint can make the question quickly become effectively "reimplement std::vector in C", and in real applications, something like this occurs daily, all C programmers have their own standard library, so I guess C++ is at least useful for its standard library...


How is this a stupid interview question?

There are a lot of positions this would be a great indicator of whether or not the candidate could do the job.

If low level programming is involved the candidate better be able to handle this. Just cause you don't have to solve this problem every doesn't mean you're not going to need those skills for systems programming, writing drivers, embedded programming, etc..

I suspect this would have an extremely high weed out rate actually. A very large % of so called software engineers would fail this, certainly anyone who did all their learning in a dynamic language with automatic garbage collection and who had not branched out into lower level programming.

It's a dumb question if the job position is for a Javascript UI dev of course.. it's not a good question for all positions.

I would also say it's a travesty for any school to award anyone a bachelor's degree in computer science to someone who would fail this question. I've interviewed a large # of candidates with master's degrees who would fail this. Thankfully most of those candidates didn't have a bachelors in computer science + a masters in computer science from a highly ranked US institution.

Also someone who groks this problem and solves it is way less likely to write memory leaks into code in a garbage collected language.


I didn't really meant to criticize the question itself, just use it to make a point on C's stdlib that the OP was talking about.


Am I missing something here? Assuming you're writing `char * fizzbuzz(unsigned n)`, can't you determine exactly how many bytes you need at the outset from `n` and just malloc once?


There are some variations of fizzbuzz that have you print out the number of iterations performed at that point on each line. The memory consumption could still be predetermined, but the math is a bit more complicated. Other variations have you reading integers from a stream (integers are not in any order), printing "Fizz" and "Buzz" when it's a multiple of 3 and 5 respectively, and that does not have a predetermined memory footprint.

While FizzBuzz has become notorious for weeding out total incompetence [1], there are some variations of the problem that are unexpectedly nuanced.

1. https://thedailywtf.com/articles/The-Fizz-Buzz-from-Outer-Sp...


I'm not really sure what's the point of this exercise instead of a regular FizzBuzz. Dealing with strings in C is tedious, error prone and results in very verbose code but it's not really difficult per-se. What is this testing for exactly?

If I had to do something like that in a real program I'd very much overalloc a few bytes (because why not? If those small allocations really become a problem in your app you probably want to use a custom allocation scheme anyway) and then just snprintf the result into it. I'd go even further and say that code that would micro-optimize this to save a couple of bytes at the cost of maintainability shouldn't pass a code review IMO.


Exactly.

It's why call it "stupid", "insidious", but "somewhat relevant", as the parent comment proposes a better stdlib in C to handle strings.


For some jobs, tolerance to tedium is considered a skill.


FizzBuzz covers a lot of (very basic) ground but adding memory manipulation is valuable. I ask a few simple string manipulation interview questions for embedded C developers as a weeder and the point is that we want to see how candidates deal with pointers and low level memory. There are other good follow-ups and alternatives you can build off this. Engineers with years of experience fail these questions all the time. Dealing with 'tedious' things like this is daily part of some developer jobs.


Is it not possible to keep allocating contiguous memory? Just keep appending to the string and allocating more memory. Repeated calls to malloc are not guaranteed to be contiguous but I think you can do this with posix_memalign. Just keep growing the heap segment, and appending more character to it. I suppose it's not going to be contiguous in physical memory, but it should in virtual memory.


You can use realloc. It will try to use the existing contiguous span of memory, but if it cannot, it will malloc a new span and copy the previous bytes in and return a pointer to it.


In that case, you need to test if the memory is contiguous and copy to the new pointer if not. O(N^2) in the worst case, where every memory allocation returns a distinct start pointer.

What I was getting at is that in virtual memory the heap is always contiguous (even if it isn't in physical memory, but we only need it to be contiguous in virtual memory). So one can guarantee that the solution will never require copying data if the program exclusively uses the stack, and the resultant string is the only piece of data on the heap. You always add contiguous memory so your string can dynamically grow without ever copying to a new buffer. This probably requires allocating memory through the use of OS-specific syscalls to request new memory pages instead of malloc or realloc.


Yes, the kernel will map virtual to physical however it sees fit, no guarantees there (at least not in user-space via glibc). Realloc will always return a virtually contiguous slice or NULL if it can’t. And you’re right, obviously all those potential subordinate mallocs would be inefficient. :) In this contrived FizzBuzz example case, you could pretty easily do the math and just malloc it all in one go at the start of the function. Fizz and Buzz are the same size, you would just need to add up the iota lengths and the trailing \0. If you take an arg for N (1..N) then stack allocations are not going to work. You need to statically declare their size.


What I'm getting at, though, is that it's possible to implement it without any copies even if the memory consumption cannot be predetermined. Bypass malloc() and realloc() entirely and invoke brk() to increase the program's segment size. This grows the data segment contiguously (in virtual memory), so there will never be a need to copy the result string to a different buffer. In other words point the result pointer to the start of the heap, and never put anything else on the heap.


O(N) is worst if you realloc exponentially with O(N) worst case possible extra unused space in the end


While I often implement strings along the lines you suggest I disagree that this feature belongs in the standard library. For one thing how big (sizeof) should the length prefix be? 16-bits? 32-bits? 64-bits?

The beauty of the C string is that it is minimal and straightforward: if you have a pointer to it, you know where it ends. And it works with pretty much any OS out there.

(My personal variant of dynamic strings are refcounted, length prefixed and nul-terminated with the pointer pointing to the traditional C string.)


> For one thing how big (sizeof) should the length prefix be? 16-bits? 32-bits? 64-bits?

size_t, or uintptr_t [1]. Maybe uint32_t, although given pointer alignment, smaller-than-pointer for size doesn't help all that much.

> The beauty of the C string is that it is minimal and straightforward: if you have a pointer to it, you know where it ends.

And if you screw up adding a null byte, client code will happily keep running through the rest of memory to find it. And if your string has embedded nulls, you're SOL anyways.

> And it works with pretty much any OS out there.

And for safety reasons, the OS generally has to assume a maximum length anyways, otherwise, you tend to have security vulnerabilities.

[1] size_t is not necessarily the same size as uintptr_t. An architecture with a 64-bit base pointer shared across several 32-bit indexes is an interesting idea that could use some more programming exploration.


Using size_t as the common solution seems like the way to go.

    struct string { char *data; size_t size; size_t alloc_size; };
    struct string_view { char *data; size_t size; };
And we have the same design that C++ uses.


C++ implementations sometimes optimize for storing short strings inside the string structure itself. One approach looks something like this on 64-bit machines:

    struct string {
        union {
            struct {
                char *ptr;
                size_t capacity;
            };
            char str[16];
        } data;
        size_t length;
    };
It looks complex, but it’s actually a really nice design because it requires no separate allocation for strings less than 16 bytes long, which is a common case, and the strings are relatively compact. And, it stores a capacity parameter which allows you to know when it’s safe to grow the string without allocating, making it possible to implement efficient repeated concatenation.


I do not think I have ever seen a string that needs more than 64K, but then again I do not read whole files in memory in C. (I do in Python.) So this would be wasteful in many cases.

My personal variant uses 32-bits for length and 32-bits for refcount, which aligns well on most platforms, including 64-bit ones.

Something like:

    struct string_header
    {
        int32_t refcnt;
        int32_t length;
        // character content starts here
    };
Given a char pointer to get to the header you must do:

    ((struct string_header *)p)[-1]


>I

>I

>I

There is no 'I' in standard libraries :)


How would you use the refcount?


I use the refcount to avoid copying the string when passing it around. I just refcount it instead.

To clarify this is part of a larger library that includes other memory management machinery, such as autorelease pools, etc.


Why not go full Pascal strings, and go for 8 bit?

The good thing about 64 bit lengths is you never need to worry about overflowing the string. The bad thing is no one know what to do, or codes for the possibility of the string overflowing. 8 bits 'should' avoid people making that assumption.


The was my thought earlier this morning.

Something anything is better than the total shit design we've been living with.


> For one thing how big (sizeof) should the length prefix be? 16-bits? 32-bits? 64-bits?

Exactly like what is returned by strlen... ?


Is your string library open sourced? Could you provide a link?


No matter what string manipulation method you use, you have to produce linear, null-terminated strings for passing into all sorts of existing API's, standard and not.


That's really not an issue when done for you. std::string does it just fine.


> null terminated strings are a joke.

There was a quiet market war some forty years ago about that, and Pascal lost.

But yeah, now we can afford it, the downsides vastly exceed the vanishing cost.


I think the problem with Pascal's strings wasn't that they stored a length. The problem was that every string variable always occupied 256 bytes of RAM and no string could ever be bigger than 255 characters.


UCSD-, TP/BP-, Freepascal?


I was in the impression that TP strings grew dynamically (memory), but I just checked for TP and this seems to be wrong. Probably too much time has passed since then ;)


And looking at modern languages, I would say that C strings also lost, just latter, since none of them do strings that way anymore.


A good reason for adding a modern string type to C would be for better interop between other languages and C.


Agreed. I feel like such a library probably also ought to support slices/arrays in the same style. I get that the C community is conservative, but this seems like a pretty non-invasive change that would make C code much less bug prone, and faster to boot!


Yep, ISO C should get an optional library, that embedded systems may want to skip, offering things like: dynamic strings, linked lists with iterator, an ordered dictionary data structure with iterator (API will be like a hash table basically), a PRNG that is not a joke. With just that it will feel like programming in a different language.


Several flavours of BSD provide header-only libraries for linked lists and search trees as part of their libc in sys/queue.h and sys/tree.h, which are remarkably nice and easy to use. newlib libc provides these, and macOS used to provide them in /usr/include (now they're only in the kernel headers - but they're header-only and self-contained so they're easy to add to any project).

sys/queue.h provides four different linked list implementations: standard singly-linked lists, standard doubly-linked lists, a singly-linked list with a tail pointer (for fast tail insertion) and a doubly-linked list without a head pointer. All come complete with

sys/tree.h provides two types of balanced search trees: splay trees and red-black trees. Both provide ordered dictionaries

Both libraries make quite extensive use of macros, but they are quite reasonably implemented and documented via man-page. I've successfully used them a few times. Shame that, like other nice BSD innovations (e.g. strlcpy/strlcat) they haven't made it into other libcs.


> macOS used to provide them in /usr/include (now they're only in the kernel headers - but they're header-only and self-contained so they're easy to add to any project)

They've been moved to the Xcode Command Line Tools with the rest of the standard headers.


> doubly-linked list without a head pointer

Off-topic, I know, but what's the benefit of this and how does it differ from a doubly-linked list without a tail pointer?


ISO C already mandates that most of its library is optional for "freestanding" implementations. (float.h, iso646.h, limits.h, stdalign.h, stdarg.h, stdbool.h, stddef.h, stdint.h, and stdnoreturn.h are the only headers required even for freestanding implementations).


What if we just created a standard [and related working group] that exists "downstream" of ISO C, where there is one release of the "C-with-batteries" per ISO C release (e.g. Cwb11 would descend from C11; Cwb18 from C18; etc.) that just adds these sort of "obvious" libc parts (or libcwb parts, I guess) that the C standards committees don't seem to want to bring into the "base" libc, without changing C itself otherwise?

Then compilers wanting to support the "with-batteries" superset standard could just allow their flag that looks like "-std=c90" to be extended like "-std=c90+batteries".


It is called Objective-C and it was pretty popular recently. Strings, arrays, classes, jsonserialization, etc.

Or you can JUST use D.

Why people always skip enormous work already done decades ago and whine that they miss something and something in C?


If every C compiler and C runtime shipped with CoreFoundation, I'd agree with you. Some undertakings are too enormous for 100 different implementors to want to go through the trouble of making their own implementations of it!


Why would you need compiler support? It's not like C doesn't already support libraries.


The whole point of libc—or any language runtime—is that it's "just there." In other words, insofar as a piece of code "is C", it can expect to have access to these libraries; and insofar as a compiler "is a C compiler", it will be expected to link your code to a runtime that includes these libraries.

What this means is that:

• people who are learning a language, can learn language features "through" examples that rely on the included batteries to demonstrate a point (for example, image support in DrRacket, or the HTTP client in Go), without needing to also learn everything involved in ecosystem package management first;

• people who want to write small, self-contained, yet portable utility programs (e.g. coreutils), can just rely on the language runtime and its presence on basically every OS, rather than declaring package dependencies (= not portable) or statically linking in their own libraries (= not small). The more stuff a language's runtime does, the more such programs become possible to write in said language.

• features in a shared runtime can rely on other things in a shared runtime; and the library ecosystem of a runtime can use the runtime's data-structures as a lingua franca to specify library APIs in terms of. JS libraries return promises because the JS runtime includes promises. Elixir libraries pass around DateTimes because the Elixir runtime specifies a DateTime type. Go libraries take and return slices, maps, and channels because those are things that exist in the runtime. If these runtimes didn't have these things—even if the languages had fancy macro systems that meant that pulling in the relevant library would enable exactly the same syntax—then library support for these would be fragmented, rather than expected. Exactly the way library support for prefix-length strings is in C.


string literals without having to process a asciz string into a ptr+size string at runtime, would be nice. C++ has language support that could enable that regardless of a future string format, of course, but C doesn't.


What's your take on the benefit of this over actually programming in a different C-like language (zig for instance)?


serious question: why doesn't C improve much?

I understand there are benefits to small or no changes whatsoever.

I also understand there are a number of standards (1990, 95, 2011...)

Is it that all the people who would be interested in improving C just naturally migrate to C++ which is getting improvements?

Your example of better strings is good, but why not lists and hashes as language elements?

If not in the language directly, then ok in the library. But speaking of the library, why limit to just a few things, why not batteries fully included. shoot for knuth in the standard library!


> Your example of better strings is good, but why not lists and hashes as language elements?

Probably because there are many ways to implement data structures, for very specific, and entirely different use cases. C gives you the minimum that allows you to build those according to your own specific needs. It would be very difficult or perhaps even impossible to come up with an implementation that suits everyone's specific cases. Perhaps we could do it by giving the users the option to pick, but then the language would be really huge, it would quickly become bloated. Think about how many ways one could implement a hash table! I like and use ohash extensively though, along with stuff from sys/queue.h. For hash tables, uthash seems to be popular, too.

> If not in the language directly, then ok in the library. But speaking of the library, why limit to just a few things, why not batteries fully included. shoot for knuth in the standard library!

There are many libraries out there. GLib comes to mind. There is also Gnulib, Klib, and God knows what else. I am sure there are a lot of libraries. I have written my own private library of common data structures, subroutines, and so on. You may want to take a look at: https://github.com/kozross/awesome-c


In the case of strings it's because in practice it's just not that much of a problem. If your not in a highly constrained environment you'd just use one of several libraries like glib that have strings and other stuff you'd typically find in the standard libraries of other languages. These libraries have the freedom evolve faster, break backward compatibility, not have to worry about committee design and keeping everyone happy, limit themselves to certain classes of machines and do all sorts of things you can't in a standard library.

C has an anemic standard library but many libraries you can use as your standard, this keeps everyone happy. I'd expect the javascript/npm world to evolve this way in future.


It all depends on your definition of improvement...

C is a bit like a Formula One car. Regardless of the historic reasons that caused it to evolve the way it did, it ended up occupying a niche and being very well suited to it, while at the same time being ill suited for other, more general purpose, uses. Under that perspective, change is slow because the people most invested in it want to make improvement happen in a very narrow and precise direction. McLaren, after all, is not going to bat an eye if you complain about its latest model lacking a baby seat!!!

C++, on the other hand, started sort of like NASCAR racing. It wanted to make a racing car out of a mundane, everyday car, and it got very, very good at it. Unfortunatelly, because C++ is based in an everyday car, and because it is Designed-by-Committee (TM), it shows lots and lots of "improvements" that individually kind of make sense, but in the bulk lack any coherence. Nowadays, C++ may perform like the Batmobile (from "Batman Begins" movie) on a good day; but you never know when it is going to bite you in the ass and turn into the Homermobile from the 90's Simpson's TV show.

It is a mix of necessity and lack of insight that many of the people in my generation had to learn how to drive in fucking racing carts!!! But when it's all said and done it gives you a little perspective on how things work and helps you appreciate the differences.


Sounds an awful lot like C should just become C++ already...


No need, C dynamic string libraries totally look like plain C.


No doubt, but they are all non standard which is irritating for people who expect that out of the box.


This is bait but whatever. Keeping some properties of C is very hard if you go this route. It also seems you'll end up playing lanaguage design whack-a-mole problems which manifests its self in C++ and co.


What they really want is to find a way to convince their boss to rewrite everything in Go.


std::string has its own share of problems, but at least it's useful for C programmers as a warning on how not to implement a string library (for instance: a string shouldn't be important enough to give it its own allocation, it shouldn't be mutable by default, there should be string manipulation functions which are actually useful in real world situations instead of being an academic exercise, and so on...).


I’m not sure I understand your first and third complaints and the second one is why const exists.


> length alongside with the (binary safe) string itself,

What is the type of the length?

What is the type of the length+string?

What lengths is it limited to?

Do we need a 32 bit and 64 bit version? And 128+ bit version?

Can you provide an implementation? What is its best and worst case efficiency?

What is its expected efficiency for the typical string operations?


One problem is that POSIX uses null-terminated strings everywhere.


So can’t you bill terminate your length prefixed string and pass a pointer to the data to POSIX. From what I understand, that’s how C++’s std::string works with ::c_str()


Starting from C++11, yes.


> null terminated strings are a joke

I think that the null-terminated representation of the raw string data is a fantastic design; the best of all conceivable alternatives.


I think memccpy is an improvement, so I support it, but it's still complicated to use in practice. The article gives this example for a copy followed by concatenation:

    char *p = memccpy (d, s1, '\0', dsize);
    dsize -= (p - d - 1);
    memccpy (p - 1, s2, '\0', dsize);
Notice that you have to recalculate dsize (correctly without one-off errors!), it assumes you have dsize itself, and this doesn't detect overruns (which in many cases you should do).

So real-world code would look more like this:

    // dsize is the space *available* in d, including \0
    size_t dsize = sizeof(d); // if d is an array
    // ...
    char *p = memccpy (d, s1, '\0', dsize);
    dsize -= (p - d - 1);
    if (dsize <= 0) goto overflow; // handle overflow
    char *q = memccpy (p - 1, s2, '\0', dsize);
    dsize -= (q - (p - 1) - 1);
    if (dsize <= 0) goto overflow; // handle overflow
It's a little easier to understand than strncat/strncpy versions, it doesn't unnecessarily read its inputs past where they are needed like strlcat/strlcpy do, and it's more efficient than snprintf. So yes, it's an improvement and I support it. However, this is still rather complex; in particular, it's way harder to understand compared to code that uses snprintf, and certainly harder to understand than pretty much any other programming language higher level than assembly.

So let's accept this improvement, and keep striving to do better.


Yeah I think strlcpy is way better exactly for this reason. I'm sure the idea behind returning the pointer was call chaining, but you shouldn't ever be doing that anyway, and with strlcpy you basically can't do an off-by-one.


It's true that strlcpy is easier to use. One challenge is that strlcpy always reads its inputs to completion, even when it is not needed to make the copy, because it has to compute the total length of the screen ignoring limits. In some circumstances that is a problem.


If you want simple, it's hard to beat the original K&R (1978!)

    while(*p++ = *q++);
which is what sold a generation on the whole idiom. Of course its time is long gone but useful to learn.


This construct:

    while(*p++ = *q++);
is simple. But I agree with you, its time has long gone, because in many programs, this code is also wrong. This idiom assumes that that the source can never be longer than the destination. There are now a legion of attackers who will exploit this code and harm its users.

Modern C programs often have to work in the presence of attackers.


size_t is unsigned, so (dsize <= 0) is identical to (dsize == 0) ;)


That is true, but I find the less than or equal to be logically clearer. YMMV.


I feel like one alternative is missing from the list:

  fp = fmemopen(buf, sizeof(buf), "w");

  fputs("one", fp);
  fputs("two", fp);
  fputs("three", fp);

  fclose(fp);
Not sure how it performs, but it reads pretty well IMHO.


Why are embedded developers unnerved by the concept of a featureful standard library? There are Linux distributions aimed at statically compiled, musl-based packages, hence you can very much choose what you need for your project.

Look at C++'s std::string in GCC, Clang, and MSVC's standard libraries and respectively its development history. Of course you can make a minimalistic standard string and also eliminate nullpointer checks, trailing \0 checks (everyone passes size_t len anyways), and allocation issues in runtime.

The only standard thing about C strings are vulnerabilities.


I used to work on routers that essentially ran embedded linux inside and our new projects started all being c++ after a while. It's amazing after the switch to c++, we basically stopped ever seeing string/array related segmentation faults now that developers use std string/vector/etc by default. Yeah, it's more resource expensive, but we have plenty of RAM now on these boards and it's totally worth not having a maintenence nightmare like our legacy C projects which I swear get a new bug report every other month about a newly discovered string/array-related segmentation fault (I'm not saying C++ can't ever be a maintenance nightmare, to be clear, but as far as memory related issues go, they disappeared the moment we started using std library).


Interesting, did you make your own allocators? Can you tell me the company?


The refusal of glibc (and I suppose, POSIX) to adopt strlcat/cpy is continually obnoxious.

That said, if you actually need efficient string operations, you probably want a Rope data structure rather than any libc primitive.


strlcat and strlcpy are not adopted because they are a really bad design. Using them correctly takes more code than not using them. In practice they are never used correctly, making them, an "attractive nuisance", a feature that causes more trouble than its absence.


This argument doesn't hold water. They're absolutely no worse than strcat/cpy and strncat/cpy, which glibc implements. I totally disagree with your premise that truncation is an incorrect use.

In reality, the alternative to strlcpy/cat isn't "force programmers to write correct code," it's "programmers will just use the crappier available functions with even worse behavior on overrun."


Perhaps you have some better reason why Posix has rejected it, again and again? Some sort of conspiracy is conceivable, but in service of what?

I have seen much, much better designs, that take into account that these functions are rarely called in isolation. In those, calls cooperate with previous and subsequent calls to share the burdens of maintaining correctness and safety.


POSIX never rejected strlcpy/strlcat, because they were never submitted for inclusion. Also, POSIX doesn't control the str* namespace, ISO C does. Not that it matters, as it wasn't submitted there either.

But that's beside the point. Every major OS's libc has an implementation of strlcpy/strlcat, and OpenBSD's can be readily lifted into a project's source tree as it is portable, a simple code search will reveal the breadth of adoption. The /only/ exception is glibc now. And glibc is not a standards committee, for years the primary objections came from one person.

You're being dishonest.

https://github.com/search?q=strlcpy&type=Code


I've never used a rope library, but having read a few essaying on those I keep coming back to it.

I also keep coming back to hiding implementation details behind closures. I'm likely not smart enough to understand why that's a bad idea tho.


C-coders of HN, do you use plain vanilla C strings in your project (s)? I was under the impression that most (at least bigger ones) use some custom length carrying string type to avoid exactly these sort of problems


It depends. If I'm writing a library, it's bare pointers. If I'm writing something not a library that's big enough, I'll use a struct {size_t length; char* string;} where the length is the string length, and string contains (length) characters + a nul byte. I might even mix in allocation data for the total allocated size of the buffer if it's important enough.

Simple to implement and use (and also backwards compatible), provided you have a library of common functions for allocating, copying, etc.

If I'm size constrained, I'll consider uint16_t for the length field. If I'm REALLY size constrained, I'll use a VLQ [1] for the length field and take the slight performance hit.

[1] https://github.com/kstenerud/vlq/blob/master/vlq-specificati...


The right thing to do here is generally to avoid dealing with strings at all: you only need it to parse input and print/log output, beyond that everything should be using integer identifiers and handles.


You can use small strings directly on the stack for massive performance gains in many scenarios, where such strings are manipulated in same context with macros or inner(called from within the current string stack context) functions. https://gcc.gnu.org/onlinedocs/gcc/Variable-Length.html#Vari...


I use plain vanilla strings. I think most of the problems with them are exaggerated, given senior enough teams.


These days my work is focused on kernel/systems programming, so plain C (char or wchar_t) strings outside the kernel, and whatever the kernel requires inside it (e.g. UNICODE_STRING on Windows).

When I do apps I have my own length-prefixed variant.


No. I would use GLib or sds. If I'm writing a library I generally try to avoid allocation, so that usually renders the question moot in those cases.


In the projects I work on (which are usually small) I use plain C strings.


"The strlcpy and strlcat functions are available on other systems besides OpenBSD, including Solaris and Linux (in the BSD compatibility library) but because they are not specified by POSIX, they are not nearly ubiquitous."

This ignores how often they are (re)implemented in userland. glib, X and even the linux kernel have implementations. Perhaps we could just standardize what programmers chose rather than allow glibc an unjustified veto?


memccpy definitely has some advantages, so it definitely should be in the ISO standard.

But memccpy has its own problems. In particular, when concatenating you have to constantly recalculate the "space remaining"; that is just asking for an off-by-one error that leads to a buffer overflow, and makes it more complicated to use. The discussion here doesn't detect attempted overflows, and that's a mistake; you often need to not just prevent an overflow, but you also need to detect an attempted overflow and do something different. You also have to pass \0, which makes the function call more complex (and perhaps under-optimized) since \0 would in nearly all cases be the parameter passed.

So I'm glad this is being added, but it's at most a small step to improving simple string copying and concatenation in C.


The Microsoft approach to this was to make a set of replacement functions (strsafe.h [1]) that are very explicit and not at all “clever”, as the strsplcasdfcpy functions seem to want to be. They return an error code so it’s obvious when the operation did what was expected, or ran out of space.

[1] https://docs.microsoft.com/en-us/windows/win32/menurc/strsaf...


But most people want to use standard, portable calls.

The C standard tried to add such functions in "Annex K", but unfortunately annex K hasn't received much of a pickup (for various reasons).

So in many places the problem continues.


This is one of the reason I tend to avoid str* functions in the first place, except for one strlen() per string.

The way I copy and concatenate strings typically looks like:

  int len1 = strlen(str1);
  int len2 = strlen(str2);
  char *buf = malloc(len1 + len2 + 1);
  if (buf) {
      memcpy(buf, str1, len1);
      memcpy(buf + len1, str2, len2);
      buf[len1 + len2] = 0;
  }
Of course, not memcpy()ing data around is even better if I can avoid it.


s/int/size_t/ and check for overflow.


Totally right about size_t, my bad, hopefully, the compiler will raise a warning.

As for integer overflow, I don't actually know how to handle it properly. In normal conditions, it is unlikely to be a problem. If the two strings can fit in memory, the sum of their size should fit in a size_t, but I agree that making such assumptions can be a bad idea.

Maybe the best way is to limit the size of the input strings to a reasonable value. That would prevent many out of memory situations too and potential DoS too.


> As for integer overflow, I don't actually know how to handle it properly.

Something like:

   if (str1 && str2) {
      size_t len1 = strlen(str1);
      size_t len2 = strlen(str2);
      size_t buf_len = len1 + len2 + 1;
      if (len1 < buf_len && len2 < buf_len) {
         char *buf = malloc(buf_len);
         if (buf) {
            memcpy(buf, str1, len1);
            memcpy(buf + len1, str2, buf_len - len1 - 1);
            buf[buf_len - 1] = '\0';
         }
      }
   }
(I probably made a mistake above.)

As you suggest, you'll probably run out of memory before you'll overflow, so in reality, you want to check len1 and len2 are some sane value, but of course, library functions don't usually have that luxury. Take a hint from git:

    #define unsigned_add_overflows(a, b) \
        ((b) > maximum_unsigned_value_of_type(a) - (a))

    if (unsigned_add_overflows(extra, 1) ||
        unsigned_add_overflows(sb->len, extra + 1))
          die("you want to use way too much memory");
https://github.com/git/git/blob/6d5b26420848ec3bc7eae46a7ffa...

https://github.com/git/git/blob/9d418600f4d10dcbbfb0b5fdbc71...

> making such assumptions can be a bad idea

It's always a bad idea, especially in an unsafe language. Never trust user input.


> Of the solutions described above, the memccpy function is the most general, optimally efficient [...]

This does not seem to be the case for me AT ALL. strcpy for example, is a lot faster than memccpy. Here are my results:

  $ gcc -O0 bench.c && ./a.out
  memccpy: 0.008405
   strcpy: 0.002913
  
  $ gcc -O3 bench.c && ./a.out
  memccpy: 0.007933
   strcpy: 0.002590
  
  $ clang -O0 bench.c && ./a.out
  memccpy: 0.008771
   strcpy: 0.003225

  $ clang -O3 bench.c && ./a.out
  memccpy: 0.007966
   strcpy: 0.000383
  
  $ musl-gcc -O0 -static bench.c && ./a.out
  memccpy: 0.007849
   strcpy: 0.005647

  $ musl-gcc -O3 -static bench.c && ./a.out
  memccpy: 0.005754
   strcpy: 0.005625
  
  $ tcc bench.c && ./a.out
  memccpy: 0.014252
   strcpy: 0.004045
Source code can be found here: https://slexy.org/view/s2EHngPvDh

---

The differences seem to be quite interesting. Did I mess up the code? Compare gcc -O3's strcpy and clang -O3's strcpy: 0.002590 vs 0.000383! musl-gcc on the other hand has much more similar results.

---

  $ gcc --version
  gcc (GCC) 9.1.0
  Copyright (C) 2019 Free Software Foundation, Inc.
  This is free software; see the source for copying conditions.  There is NO
  warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

  $ clang --version
  clang version 8.0.1 (tags/RELEASE_801/final)
  Target: x86_64-pc-linux-gnu
  Thread model: posix
  InstalledDir: /usr/bin
  
  $ tcc -v
  tcc version 0.9.27 (x86_64 Linux)


>Did I mess up the code?

You certainly did; both your benchmark functions overflow their internal buffers.

However, it's not all that improbable for memccpy to be slower than strcpy in this case, even if it was used correctly; after all, it does more, and by doing so would prevent the buffer overflow if supplied with correct arguments (specifically, the last one). As to how much slower, I cannot tell. Also, you're skipping over half the point of the article by using a (fixed) source string, the size of which is known in advance.

In general, I would not benchmark library functions on local variables (buffers) without observing their results. It's far too easy for the compiler to remove the call altogether when said removal doesn't make any difference on the output.


> You're skipping over half the point of the article by using a (fixed) source string, the size of which is known in advance.

You are right. I should have focused more on that instead, that seems more relevant to why the author of the article is suggesting memccpy. I am curious as to whether or not it really is the case that memccpy is "optimally efficient" in practice over the alternatives. Would you like to prove or disprove that statement yourself? I modified the code a bit; it uses strlen to calculate the length of the string passed, and the string is argv[1]. memccpy is still just as slow. Is this a more acceptable approach to you? In this case we do not know neither the string, nor its length in advance. strcpy still outperforms memccpy. Is this sufficient to disprove the claim that memccpy is "optimally" efficient to other alternatives? The other criterion was being widely adopted, in which case, well, strcpy also looks good. Moreover, as dwheeler pointed out, memccpy is a tad difficult to use in practice. I will give strlcpy a try, too, since I prefer that over strcpy. In any case, I am not convinced that these criteria hold true for memccpy over the alternatives.

> on local variables (buffers) without observing their results

What do you mean exactly? I did observe the results of the buffer. See the printf, or are you not referring to that?

> both your benchmark functions overflow their internal buffers.

Would you please elaborate on it, and its relevance? Are you referring to N being too high?


>Would you please elaborate on it, and its relevance? Are you referring to N being too high?

No, the stack size is implementation-defined anyway. Instead, you have a classic off-by-one error because you didn't reserve any space for the final null terminator.

Correctly used memccpy would protect against an issue like this, although the destination string would not be correctly terminated, as it's not a safe string function. Also, if your memccpy version had the correct arguments inside the loop, you wouldn't have needed the extra call before the loop to hide the issue, as the memccpy call would have been functionally identical to strcpy except for the last pass of the loop.

>What do you mean exactly? I did observe the results of the buffer. See the printf, or are you not referring to that?

At least in the link you provided, all the printf's that would observe the contents of buf after the loop are commented out. No observable change happens in the execution of the program even if your compiler decides to just remove any calls to strcpy or memccpy.

--

That being said, strcpy is quite efficient at what you're benchmarking; that is, "multiplying" short strings. The task doesn't highlight its shortcomings. (strcpy wouldn't be too bad even if the strings were longer, although memcpy might be slightly faster.)

But consider the following silly example (not checked for errors) that does highlight the issue:

  char *next_insert;
  size_t remaining_size;

  void append_memccpy(const char *str)
  {
    char *tmp = memccpy(next_insert, str, '\0', remaining_size); // single pass over str
    if (tmp) {
      --tmp; // move pointer to terminator from one past it
      remaining_size -= tmp - next_insert;
      next_insert = tmp;
    } else { // insufficient size remaining
      str += remaining_size; // first remaining_size bytes are already copied
      allocate_more();
      append_memccpy(str);
    }
  }

  void append_strcpy(const char *str)
  {
    size_t len = strlen(str); // first pass over str
    if (len + 1 < remaining_size) {
      strcpy(next_insert, str); // second pass over str
      remaining_size -= len;
      next_insert += len;
    } else {
      allocate_more();
      append_strcpy(str);
    }
  }
Now, even though the latter version is extra silly (just to resemble the former more), it doesn't change the fact that with strcpy, we have to process each byte in str twice. If str is long enough, that might not be exactly free.


Thank you for the reply. I did think about the significance of the length of the string, but I was too lazy to benchmark that. Perhaps another time. Theoretically, for the reasons you mentioned, memccpy should perform better on larger strings, but I am not sure if that really is the case in practice (slow implementation of memccpy, lack of compiler optimizations, etc.), and it seems like that "stick to memccpy" is not an universal rule (obviously). :D


Note that I mentioned the ordinary memcpy (with a single c) there briefly. memccpy should under no circumstances be faster than strcpy for "string multiplying", as it holds no advantages over it in that use.


Ouch, my mistake.

In any case, could we sum it up? In what cases should memccpy be used over, say, str{n,l}cpy, or even memcpy, and is it in conflict with the article's recommendation or its statement on performance regarding memccpy vs. the alternatives?


As far as stack overflow goes, you could just increase the stack size or make the N smaller. That is besides the point, I intentionally avoided dynamic memory allocation. Also even if I put everything (including the strlen) inside the loop, memccpy is still much slower. I still have not examined the assembler code.


strcpy is optimized, memccpy not.

First use -march=native on a new clang (>5), and see if clang can optimize memccpy by itself. gcc probably not.

And BTW, gcc-9 is still broken and should be blacklisted everywhere.


Thanks for the tip. I made some changes to the source code so the one above is a bit outdated. I should probably try the two different implementations found on the website, too.

> gcc-9 is still broken and should be blacklisted everywhere.

Yes, it seems to be the case. I ran into a few peculiarities, to say the least.


Looks like something was optimized into being removed.


It could be, yeah, I did not have the time to check out the assembler code. I will do it tomorrow, but someone else may do it before me and elaborate on the reasons. :)


> The committee chose to adopt memccpy but rejected the remaining proposals.

Is that the case? Reading the updated standard draft[0] they also included strdup and strndup. May be they rejected first, then chose to add later.

[0] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2385.pdf


Why not abandon \0-terminated string and pass the length of strings as an additional parameter?



Pascal strings prefix the string with the length. What I'm arguing for is more like fat pointers, where the length is stored at the same location as the pointer. Something that is already standard to do in C for binary data.


Because floating variables are ambiguous and you can lie about length (unintentionally too).


Efficiency looks past current deficiency.

We have the empty string: "\0"

We have the null string: NULL

There is no concept of an INVALID string, as float has NAN.

This would be the result of trying to copy a string to a buffer that is too small.

Or sprintf() into a small buffer.

Or a raw string parsed as UTF-8 and is invalid.

Correctness over efficiency.


I'd argue that an invalid string concept would be neither correct nor efficient. Why should all code that deals with strings carry the burden of fallibility of a subset of string functions?

You've mentioned NaN propagation in another comment and I think that's a perfect example of the problem with this approach. Sorting a vector of arbitrary floats is a notoriously thorny problem because any float could be NaN, and as NaN is incomparable to any other float, there is no total ordering of floats. There is no general solution to this problem that doesn't involve making assumptions that could be faulty for some applications.


Please support your argument against correctness by providing an example where an INVALID string as input to a suitable modified generic string function would result in a valid string.


What is length of an invalid string? What is the length of the concatenation of two invalid strings?

There are sensible answers. But they are weird.


Is it more sensible to cat 2 strings, but cut off the second one, then pass off the result as valid?

I would say let an INVALID string be length 0. Then accept that catting a valid and invalid string would result in a shorter length.

Which one do you think is safer?


I would expect an invalid string to have an invalid length. For integer-valued lengths you'd have to use a negative number to differentiate from a valid, empty string. But then the sum of the invalid-string lengths differs from the length of the concatenated invalid strings. Which is wonky.


Safe string manipulation never exceeds the bounds of the buffer. So negative values are dangerous, as all as any additions that would exceed the maximum size.

Negative lengths are not compatible with unsigned representation.

A system implementing invalid string values must choose a text encoding such as UTF-8 that supports the concept of an invalid character. Null termination is too flexible. As such is simple length prepending.


It's not an "argument against correctness" it's an argument to what you are proposing


I don't understand the fallibility. Clearly misuse of string functions is epidemic. A propagating INVALID string result makes it very clear there is a logic error and not an exploit.

I understand how one could shoot down implementations, but none has made a convincing argument about shooting down the idea.


I wouldn't prefer one more special case to test against (empty string / null string / 'invalid' string). Why can't those operations just return error codes instead? How about memcpy if you try to memcpy into a buffer that's too small - it writes an 'invalid buffer' type instead?


Propagating NAN is an elegant method in floating point and makes sense for well defined string encodings like UTF-8.

memcpy and company are strictly for raw unencoded buffers.


”This would be the result of trying to copy a string to a buffer that is too small.”

C doesn’t have the notion of “size of buffer” (yes, arrays have a size that can be queried by sizeof, but only at compile time). You would have to fix that, first.


> There is no concept of an INVALID string, as float has NAN.

Isn’t that just NULL?


NULL is the lack of any string. If one view a string as a result of an operation, then an INVALID string is the consequence of bad input to an operation.


Why can’t NULL serve as the invalid string in this case? It’s clearly not a valid string that an operation will return.


If you have studied Computer Science, you should know that the null string is quite a valid string.

Let's take strstr, which finds a matching substring needle in a haystack string.

-returns a NULL string if the needle is not in the haystack. -returns pointer to first matching substring.

Extend strstr with VALIDITY

Understood behaviour if both are valid.

Say the haystack is INVALID...as the return value is NULL or a strict substring of haystack, should return INVALID. A poison haystack should poison dependent strings.

Say the haystack is valid but the needle is INVALID...should return NULL. A valid string never contains an INVALID string as a subsequence.


Here's my behavior:

  strstr(NULL, /* valid string */)
I can't find the needle in the haystack (actually, I can't find anything in the haystack. I can't find the haystack.) Thus I return NULL.

  strstr(/* valid string */, NULL)
I can't find the needle in the haystack (actually, I wouldn't be able to find it: I don't know what I'm looking for.) Return NULL.


You're explanation is not inconsistent with my proposal, but you don't seem to grasp VALIDITY.


That web page's color choices have made it very difficult to read. I don't know who thought putting light grey text on white was a good idea. I had to copy and paste the text to a text editor in order to read it.


I usually just press Ctrl+A to select all in these situations to make the text readable.


Does every org have their own C string library or does it just feel like it?


The last job I had working in C didn't--we leaned heavily on libraries for stuff like strings, logging, hashtables, and serialization. Implementing that stuff yourself is either a big timesink, or just asking for bugs and security issues.


And their own logging, hash map, and list utilities.

Admittedly, people seem to write string and logging libraries even in languages that do provide them.


And their own serialization format


OT: Is there any API out there that implements say a length* prefix in all C strings?

* 2 or 4 octets size


Yes, those are a dime a dozen. E.g. GString[1] is one example. Rather than strict "Pascal strings" like the sibling comments rightly points out would be stupid, they're a light struct with a char/size_t len (and usually also size_t alloc_len). This allows for skipping strlen() on them before e.g. copying, while using them as a C string by getting directly at the char.

1. https://developer.gnome.org/glib/stable/glib-Strings.html#GS...


What you're talking about typically goes by the name of "pascal strings", and while they're possible to do, C's string literals are not compatible with them, so nobody does it.


It is certainly possible to declare Pascal string literals with not much hassle: https://stackoverflow.com/questions/7648947/declaring-pascal...

One of the answers states that GCC and Clang do have support for Pascal strings.

Probably these strings do not work as (well) in #defines, i.e. they don't concat like regular literals.


sds (simple dynamic strings) is a decent compromise: https://github.com/antirez/sds (used in redis)


The PJ-SIP project defines their own length-based string wrapper for C strings.

https://github.com/pjsip/pjproject https://www.pjsip.org/pjlib/docs/html/structpj__str__t.htm


sds comes to mind, but there are like gazillion different implementations of the same concept.

https://github.com/antirez/sds


Obligatory sidenote on website itself rather than content: please don't set such a low contrast between font color, choice, and background color. I had to copy/paste this article to read it.


Use the developer tool to inspect CSS.

1. Uncheck

    article .entry-content {
        color: #646464;
    }
And a new CSS color rules appears.

2. Uncheck

    body {
        color: #333;
    }
Done!

I used to think it's ridiculous to manipulate a webpage like this manually, but now I believe: if it's helpful for your for an one-time browsing on a broken webpage, why not?


yah, thin light grey sans-serif body font... "Why do you hate my eyes?"




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: