Show HN: A nice C string API

cassepipe · on Dec 3, 2022

I always use antirez's (Redis creator) `sds` and advertise it whenvever I get the chance. Thanks to the someone who recommended it on HN some years ago. It's a joy to use.

https://github.com/antirez/sds

The trick is the size is hidden before the adress of the buffer.("Learn this one simple trick that will change your life for ever").

From the Readme:

```

Advantage #1: you can pass SDS strings to functions designed for C functions without accessing a struct member or calling a function

Advantage #2: accessing individual chars is straightforward.

Advantage #3: single allocation has better cache locality. Usually when you access a string created by a string library using a structure, you have two different allocations for the structure representing the string, and the actual buffer holding the string. Over the time the buffer is reallocated, and it is likely that it ends in a totally different part of memory compared to the structure itself. Since modern programs performances are often dominated by cache misses, SDS may perform better in many workloads.

```

lifthrasiir · on Dec 3, 2022

I don't like SDS for multiple reasons. My biggest complaint is that it's a data structure disguised as a single naive pointer, which is actually harder to use correctly. This kind of "masquerading" pointer is conceptually a linear type, as you can't safely change its length in place and any potential change has to return the modified pointer somehow. No other type in C behaves like this, resulting in more confusion and thus more errors. And I have more counterpoints to those self-claimed advantages as well:

Counterpoint #1: You can't pass SDS strings to functions that accept `char **` (which is a common way to return a string of unknown length, and often can act as an in-out parameter as well).

Counterpoint #2: You rarely access individual "characters" (whatever this means). It is a conscious decision to whether you should iterate over bytes or Unicode scalar values or code points or grapheme clusters, and for this reason it is better to make the decision explicit even though it's C `char` in the surface level.

I have no evidence for nor against advantage #3 though.

kevin_thibedeau · on Dec 3, 2022

> No other type in C behaves like this,

Malloc does this.

lifthrasiir · on Dec 3, 2022

Malloc is not a type. If you meant to say realloc, a good point and it's indeed a bad interface for the exact reason but still it's not a type.

Karellen · on Dec 3, 2022

I believe malloc() was intended, as a number of old-school UNIX implementations of malloc() put the size of the allocation (and possibly other bookkeeping info?) "in front of" the pointer returned, in a similar way to how sds stores the size of its buffer.

WalterBright · on Dec 3, 2022

> The trick is the size is hidden before the adress of the buffer.("Learn this one simple trick that will change your life for ever").

The length-prefix string has a major problem - it cannot be sliced to produce another length-prefix string. It has to be copied. Instead, using a phat pointer (size_t length, char* ptr) works very, very well. We've been using it in D for 20 years.

I've proposed it for C, too:

https://www.digitalmars.com/articles/C-biggest-mistake.html

quelsolaar · on Dec 3, 2022

The problem with arrays is not that they decay to pointers, its that they arent pointers to begin with. This:

int x[10];

Sould mean "put 10 ints in memory, and make x the pointer to it.". The thing that messes this up is sizeof. sizeof(x) doesnt give the size of the pointer like it should, it gives you the size of the array. If that was fixed (obviusly it can without breaking everything) then things would be much better and consistent.

WastingMyTime89 · on Dec 3, 2022

Yes, sizeof has a primitive as a weird behaviour when used in the scope of a continuous allocation. I agree it’s unfortunate.

But "arrays" definitely are pointers. I put "arrays" in quote because C has nothing I would personally call an array. It’s just contiguous memory allocation. It’s to the point that 10[a] and a[10] are desugared to the same thing.

torstenvl · on Dec 3, 2022

> it cannot be sliced to produce another length-prefix string

Come again? Of course it can. It can't be done in place, mind you, but that's a pretty bad way to do any string slicing, regardless of implementation, in a manual memory management environment. Do most programmers expect their slices to result in undefined behavior if they release the larger string they were made from? I doubt it.

alcover · on Dec 3, 2022

> Come again? Of course it can.

Oh come on.. I'm pretty sure Walter meant taking a view kind of slice. Obviously one can always copy part of a string, but that's not what slice implies I think.

> It can't be done in place, mind you, but that's a pretty bad way to do any string slicing, regardless of implementation, in a manual memory management environment.

It's not bad. It's the best, most efficient way. O(1)-ish.

> Do most programmers expect their slices to result in undefined behavior if they release the larger string they were made from? I doubt it.

That's what copy-on-write is for : release of the parent is blocked until no views are left on it.

I made a C String lib using CoW. It works well:

https://github.com/alcover/buffet

WalterBright · on Dec 3, 2022

The length-ptr strings are very fast and efficient:

1. don't have to make a copy

2. don't have to scan to find the length

3. don't have to load the string into the cache to find its length

As a bonus, they're memory safe.

masklinn · on Dec 3, 2022

> The trick is the size is hidden before the adress of the buffer.("Learn this one simple trick that will change your life for ever").

This has drawbacks:

1. you can't convert an existing buffer to an sds buffer

2. you can't slice into a buffer, because the metadata is part of the string's buffer (even if it's before the pointer)

_448 · on Dec 3, 2022

> The trick is the size is hidden before the adress of the buffer.

That is how strings use to be stored before C made the choice of using null-terminator. Pascal stored the string size before the string data. The advantage of relying on a terminator symbol is that the string size can be any length where as storing the size at the start forces the string to not exceed certain size.

drfuchs · on Dec 3, 2022

Nit: Many Pascal compilers / runtimes extended the language in non-standard ways, including various schemes for storing string length in front of the string. But nothing like this was ever part of the ISO Pascal standard, and it was certainly not in the "PASCAL User Manual and Report" by Kathleen Jensen and Niklaus Wirth.

In fact, in standard Pascal, string handling is extremely rudimentary; there was no way to express "this variable / parameter / pointer refers to a string with a length not known at compile-time".

pjmlp · on Dec 3, 2022

They were on the ISO Extended Pascal, which hardly mattered because by then, USCD Pascal and Object Pascal already had taken over the world of Pascal dialects, both of which had better ways to deal with strings.

https://www.iso.org/standard/18237.html

Additionally, Modula-2 was already available in 1978, sorting out all the issues of original Pascal, with all the features needed for a safe systems programming language in the late 70's.

drfuchs · on Dec 3, 2022

In the late 70’s, there were production-quality Pascal compilers for DEC 20 / ITS / SAIL, Vax/VMS, IBM 360/370, together covering much of academic computing and most of the ARPAnet. Even consulting Wirth, Knuth couldn’t find suitable Modula-2 compilers available for these, so TeX used Pascal and not Modula-2. Near as we heard, it was only ever seriously used on the niche ETH workstation?

pjmlp · on Dec 4, 2022

That is like complaining that C wasn't available in plenty of computer hardware outside Bell Labs until early 1980's, after UNIX V6 release, was it ever seriously used outside AT&T research units?

Do you seriously expect in two years left of 70's for Modula-2 to be available everywhere?

drfuchs · on Dec 4, 2022

No, I was simply responding to your statement that "Modula-2 was already available in 1978". This phrasing might give the wrong impression; Modula-2's practical availability was extremely limited. No doubt it was a fine language, but it never caught on to anywhere near the extent Pascal had, before C proceeded to take over the world. If you wanted to write run-everywhere software, Modula-2 was never a tenable choice.

But we digress. The topic at hand was "How does Pascal store strings?" I stand by the pedantic statement that ISO Standard Pascal does not require string lengths to be stored with the string (nor anywhere else); there's no way for the programmer to obtain the length of a string at compile-time, never mind run-time, without resorting to some sort of compiler-specific language extension (most (all?) of which did indeed put the string length in front). Conversely, the Extended Pascal standard pretty much requires strings to instead be implemented as "fat pointers," consisting of a length and a pointer to the actual characters. I say this because substring operations return references to pieces of the original string, thus you can't stuff a length in front of the actual characters of the substring, as this would over-write characters in the original string.

anonymoushn · on Dec 3, 2022

In execution environments with 64-bit pointers you may have trouble loading a string of more than 16 exabytes into RAM anyway

thaumasiotes · on Dec 3, 2022

> The advantage of relying on a terminator symbol is that the string size can be any length where as storing the size at the start forces the string to not exceed certain size.

In the same way that since we identify unicode code points with a 16-bit value, it's impossible to include U+1D460 in a string?

In the same way that since Matroska files encode the length of their segments, there's a hard upper limit on the length of a segment?

Of course none of those things is actually true. Storing the string size has no implications for how long the string can be. It requires an amount of space, to store the string size, that is logarithmic in the length of the string, and completely insignificant.

jstimpfle · on Dec 3, 2022

For sake of simplicity, and for efficiency with really small strings, with a length-prefixed string representation you really want to keep the string length field fixed-size. In general.

thaumasiotes · on Dec 3, 2022

Really small strings have a fixed-size length field in any variable-size encoding of the length. They're small, so they fit into whatever the smallest possible length field is.

What do you gain in handling short strings from an inability to handle long ones?

jstimpfle · on Dec 3, 2022

Ok I give you this one, but I still don't think that minimizing the size of a length field using a flexible width encoding is a good idea except when talking about extremely specialized string encodings (like compression schemes).

Flexible width encoding is more complicated compared to simple member access to get at the first character. And how do you handle construction of a string whose size you don't know yet? You might have to move the string away to make space for a bigger string length field. I don't like it.

thaumasiotes · on Dec 3, 2022

> Flexible width encoding is more complicated compared to simple member access to get at the first character.

I don't think this is true either. It's almost true. But what happens if the string length is 0?

If you make the assumption that you can access the first character of a zero-length string by just grabbing whatever is in memory after the string header, you're going to make the exact mistake the length field is there to stop you from making, a memory access violation. You have to process the length field in order to do any access at all; many strings don't have a first character.

> And how do you handle construction of a string whose size you don't know yet? You might have to move the string away to make space for a bigger string length field.

That's true; you'll either need to be willing to store the character data and the length metadata in separate locations, or you'll need to be willing to occasionally move the data around.

jstimpfle · on Dec 3, 2022

Obviously I mean get at the address of the first character, if any. You can't load before you know that what you load is valid. Btw. zero-terminated strings allow you to load unconditionally. Sometimes that's nice.

thaumasiotes · on Dec 3, 2022

OK, but now the difference in how complicated it is to read from the string boils down to this:

    1. Read the first chunk of the string length.
    2. Is it more than 0?

vs

    1. Read the first chunk of the string length.
    2. Did we get the whole thing?
    3. Is the length more than 0?

That extra step in the variable-length case means checking whether a bit is set in the value you just read.

---

Also, it occurs to me that this whole discussion is talking about how to serialize or deserialize a string, when the original discussion is over how the string should be represented in memory.

lelanthran · on Dec 3, 2022

> In the same way that since we identify unicode code points with a 16-bit value

Not a single 16-bit value. Some codepoints are two 16-bit values.

kevin_thibedeau · on Dec 3, 2022

Codepoints are 21-bit values. They may have a more compact encoding but the unencoded form is fixed.

tored · on Dec 3, 2022

Another is Hollerith strings that was used by FORTRAN and TCP protocols.

https://en.wikipedia.org/wiki/Hollerith_constant

jstimpfle · on Dec 3, 2022

The biggest advantage of zero-terminated to me is simplicity, next would be efficiency for really small strings - although this is a fringe concern. Strings with explicit length should at least have a 32-bit length field (maybe 64) IMO - for example, it's common to read files (and store them in contiguous memory) that are larger than 64K.

lifthrasiir · on Dec 3, 2022

Most memory allocators have an internal fragmentation which removes most efficiency gained by zero-termination. In fact it's worse, because zero-termination means that deallocation can't take a size parameter and it can often cause a performance hit for many modern allocators due to cache misses [1].

[1] https://isocpp.org/files/papers/n3778.html

jstimpfle · on Dec 3, 2022

zero-terminated says nothing about allocation, so there's that.

lifthrasiir · on Dec 4, 2022

You were talking about [space] efficiency for small strings, which is inherently related to the memory allocation scheme.

jstimpfle · on Dec 4, 2022

What? You can allocate zero-terminated strings in .rodata by typing "a string literal" in source code, for example. It's fair to say that there's no fragmentation or whatever and you don't have to think about deallocation either.

Other popular approaches to allocate strings dynamically would generally group strings of similar lifetime or by size, amortizing possible overheads over many strings.

If your favorite memory allocator requires a size field to support efficient deallocation, then by all means put that in the allocation record. But this is totally orthogonal to the format of a simple string representation.

jbverschoor · on Dec 3, 2022

Well you could easily use the first 4 bits to indicate how many bytes the length is + 1.

      c0 (0b0000) -> length is 0xc = 12
    1341 (0b0001) -> length is 0x134 = 69,940
  239a42 (0b0010) -> length is 0x239a42 = 145,828

  deadbeaf239a47 (0b0111) -> length is 0xdeadbeaf239a4 = 3917404957718948.

This gets you a 7-byte = 56bit number, minimal overhead for smaller strings.

Maybe reserve 0x1111 for future use.

Maybe the other endian makes more sense here, and maybe 0 should mean zero-length.

It's probably not very performant compared to other solutions, but you can just shift 4 bits, and you're done

I'm curious how many strings are allocated at a particular point in time (across everything, kernel, os, apps, etc)

jstimpfle · on Dec 3, 2022

These considerations can make sense when thinking about storage formats (probably you want to compress the string too), but they are not convenient for in-memory representation where you want to get the location of the first character with a simple member access.

jbverschoor · on Dec 3, 2022

It starts at the start of the frame + the first nibble + 1

thom · on Dec 3, 2022

SDS supports 64-bit lengths. It also dynamically changes the size of its size/flags field to accommodate growth. The minimum overhead is an extra char (same as null termination).

abcd_f · on Dec 3, 2022

The length can be packed, e.g. like utf-8 does it or something similar. The caveat is the cost of unpacking on access, but the memory overhead will be minimal.

aap_ · on Dec 3, 2022

Null-termination was not a C invention.

mh7 · on Dec 3, 2022

strlen() returns a size_t so you're already constrained to a maximum length of SIZE_MAX.

jstimpfle · on Dec 3, 2022

This is hilarious. SIZE_MAX is at least as large as the largest string that you can put in your address space / memory anyway. Which is what the strlen() API already assumes.

That, plus you'd be a fool to store a huge string in this way anywhere (in or out of memory) in any case.

Someone · on Dec 3, 2022

> SIZE_MAX is at least as large as the largest string that you can put in your address space / memory anyway.

Not necessarily. A 64-bit system could give processes an address space that’s significantly larger than half the full 64-bit address space and have an allocator that allows you to allocate a block of more than SIZE_MAX bytes (malloc takes a size_t, but you can use calloc)

jstimpfle · on Dec 3, 2022

This doesn't make sense to me. You can't "allocate" more than SIZE_MAX bytes by definition. If you take "allocate" to mean "make it available in the process's address space", that is.

unwind · on Dec 3, 2022

Are you sure?

The calloc() [1] function mentioned above takes two values of type size_t, and allocates their product bytes.

I'm on mobile without (!) the C99 draft spec but at least the man page gives no such restriction.

[1] https://linux.die.net/man/3/calloc

jstimpfle · on Dec 3, 2022

How would it be possible to allocate more address space than is addressable?

calloc returns NULL when can't satisfy the request. The idea of taking two arguments is not to allow the user to specify a larger requested size, but to protect against overflows as it can happen with e.g. malloc() where the user has to compute the size of arrays by multiplying NUM_ELEMS * SIZE_PER_ELEM. And the user will normally do so less carefully than a library function.

mek6800d2 · on Dec 3, 2022

I read something about this recently, somewhere, maybe HN. Specifically, in calloc(), what is done and what should really be done if the multiplication overflows. As will happen, for example, if you try to calloc() two elements of size SIZE_MAX, when SIZE_MAX is the maximum representable unsigned integer value on the machine. So, I don't think calloc() is available or intended as a way to circumvent malloc()'s size restriction.

Someone · on Dec 3, 2022

I stand corrected. Initially, I thought that, even if it calloc can’t, an OS could provide a different way to obtain a pointer to a memory region that’s larger than SIZE_MAX.

However, the standard says (https://en.cppreference.com/w/c/types/size_t):

“size_t can store the maximum size of a theoretically possible object of any type (including array).”

and (https://en.cppreference.com/w/c/language/pointer):

“Pointer is a type of an object that refers to a function or an object of another type, possibly adding qualifiers. Pointer may also refer to nothing, which is indicated by the special null pointer value.”

⇒ pointers must either be null or point to an object, and objects aren’t larger than SIZE_MAX, so I think having a pointer pointing to a block larger than SIZE_MAX violates the standard.

jstimpfle · on Dec 3, 2022

> pointer pointing to a block larger than SIZE_MAX violates the standard.

it's simply not possible, by definition.

Karellen · on Dec 3, 2022

size_t is unsigned, right? ssize_t is the signed version?

On a quick test on my 64-bit system, a C program doing `printf("%zu\n", SIZE_MAX);` outputs 18446744073709551615, which looks like (2^64)-1 to me.

Or is there a thing in the standard that says this isn't always the case?

mananaysiempre · on Dec 6, 2022

No, ssize_t is not the signed version. As best as I can tell, the only things POSIX says about ssize_t is that[1] it is an integer type that can hold integer values in [-1, SSIZE_MAX], where[2] SSIZE_MAX ≥ _POSIX_SSIZE_MAX = 32767, not that it should have any particular relation to size_t. In the standard, it is used for byte counts in I/O, like the return value of read() (traditionally int), for the return value of strfmon() and strfmon_l() (OK I guess, though the C standard stuck with int for *printf()), and for the argument to swab() (wat).

Note that neither is ptrdiff_t guaranteed to be that signed version, or to hold any possible value in the domain of size_t or (strictly speaking) any possible object size. Both GCC and Clang assume the latter, though, and can miscompile[3] code that relies on (e.g.) malloc() succeeding for sizes > 2^31 on a 32-bit system.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sy...

[2] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/li...

[3] https://trust-in-soft.com/blog/2016/05/20/objects-larger-tha...

arcticbull · on Dec 3, 2022

ssize_t is a weird one, the only negative value it is guaranteed to store is -1.

> The type ssize_t shall be capable of storing values at least in the range [-1, {SSIZE_MAX}].

[1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sy...

jstimpfle · on Dec 4, 2022

size_t need only be large enough to cover the (virtual) address space. It's up to hardware and OS to decide how much addressable space you get. I believe current systems can use only the low 48 bits of 64-bit pointers. However that number is likely to be increased in the future and OSes would be unwise to define size_t as something smaller than 64 bits.

ahepp · on Dec 3, 2022

Isn't size_t defined as being able to fit the largest possible data allocation?

pjmlp · on Dec 3, 2022

Indeed, you just need to forget to put a terminator to get a nice memory dump.

jcelerier · on Dec 3, 2022

If you use a different data structure you would maybe use a different API for accessing it too

schemescape · on Dec 3, 2022

That sounds like the same allocation scheme as used in Microsoft’s BSTR type: https://learn.microsoft.com/en-us/previous-versions/windows/...

kazinator · on Dec 3, 2022

Thus, sds it cannot be used for the use cases that this library allows.

This library takes string slices without having to allocate or copy memory; it seems to be for use cases involving breaking down strings in complex ways, where good ergonomics and efficiency of obtaining a null-terminated C string are secondary.

quelsolaar · on Dec 3, 2022

Optimizing text is hard because you seldom up front know how much memory will be needed and allocations are slow.

Another way to do it is to use:

typedef struct{ size_t allocated; size_t used; char buffer[]; }String;

This lets the header and the string be the same allocation. Thats a huge saving. Its also useful to store the allocation and use size separatly so that you can reuse / modify buffers. The used field lets you use memcpy without looking for string termination.

You can make it even more complex by adding flags if the string is on the stack or on the heap. That way you can do things like:

String buffer = MACRO_TO_CREATE_STRING_BUFFER_ON_STACK(256), *b; b = &buffer;

b = do_processing_with_buffer(&b); // allocates on heap a larger buffer if needed

string_free(b); // frees buffer if its on heap.

jstimpfle · on Dec 3, 2022

In general it's hard to get more efficient than a simple struct String { const char *buffer; u32 size; }. Your method removes an indirection from the allocated storage, but you'd still need an external pointer to point to that struct in most cases. That, plus retrieving the size now costs an additional dereference. So I wouldn't use your method unless I knew that I'd have to reference the string from multiple locations.

The best way to be efficient is often to make assumptions about the data. Most strings don't need any dynamic allocation after having been "built". So it makes a ton of sense to make a string builder API that returns a final string when it's finished. In this way, you save at least the "allocated" member.

The advantage of the simpler string representation is that it works for any string (or substring) that is contiguous in memory, and is completely decoupled from allocation concerns. E.g. I can easily

    #define STRING(string_literal) \
        ((String) { string_literal "", sizeof string_literal - 1 })

, to be able to statically declare such strings like this:

    String my_string = STRING("Foo bar");

If you have many strings that you know are small, then just the normal nul-terminated C string (without any size field) is as storage-efficient as it gets.

In practice, I find string handling so easy that I rarely even define this struct String. I just pass around strings to functions as two arguments - pointer + size. It feels so light and data flows so easily between APIs, I love it.

jlokier · on Dec 3, 2022

On point 3, you can achieve the same cache locality, without losing the ability to take slices or append, by having the string object contain a pointer to the string bytes, and allocating the bytes by default immediately after the string object.

It is still single allocation, so the allocation is just as fast.

The pointer is in the same cache line as the string bytes in all strings except for slices (and any other fancy indirect string types). Even though the code fetches indirectly via that pointer, the CPU will be able to fetch the initial string byte efficiently as soon as it has the pointer.

estebank · on Dec 3, 2022

How would this colocation of the string pointer work? Because these would be in the heap, right? Otherwise the pointer would get invalidated as soon as the enclosing function ends and its stack frame gets discarded. So if it is in the heap then you either have a pointer to the colocated pointer (not very useful, if negligible performance impact) or you're copying the colocated pointer (at which point you're back to square one, having a pointer in the stack and the underlying string in the heap). Am I missing something?

alcover · on Dec 3, 2022

Good point. Some SSO (small str optimization) schemes achieve this by pointing back into the struct itself. Gcc String implementation for ex.

gorgoiler · on Dec 3, 2022

Whoa. Jamming metadata in the address space before the string pointer is such a clever idea. I don’t know enough about C to know how many awkward bugs this might cause, but I know enough about programming to spot exceptional lateral thinking when I see it. Very neat.

I guess the SDS authors might ship a linter to spot all the times you mistakenly use free() instead of sdsfree()? That could make the cleverness more tolerable?

arcticbull · on Dec 3, 2022

This is a common approach for things like malloc to use, since you are passing an opaque pointer to arbitrary data into free() which you then expect to quickly do something useful with. It can just walk back the pointer a little to find the header and act on it.

It's pretty weird to see it anywhere other than malloc though especially masquerading as a basic type. It's incompatible with other common patterns like returning via (char *) and you can't identify which deallocator you're supposed to give the result to from the type alone.

Gibbon1 · on Dec 3, 2022

Random musing of a old firmware guy. In the past I've had issues with wanting to make sure a function isn't being passed a pointer to an object on the stack. Least in embedded land it's trivial to write a function that can tell you if an address points to something on the stack, heap, or is a global.

deterministic · on Dec 4, 2022

This is actually a very well known C trick.

realgeniushere · on Dec 3, 2022

Makes me think less of antirez that he doesn’t acknowledge that this is the same design as Microsoft’s BSTRs, which predate sds by many many years.

dixie_land · on Dec 4, 2022

I also thought of BSTR but in all fairness this idea of prefixing metadata for string types aren't so unique: eg Delphi's string types use a similar scheme: https://docwiki.embarcadero.com/RADStudio/Alexandria/en/Inte...

One obvious advantage of such scheme of course is that you can pass them as-is to C APIs

mh7 · on Dec 3, 2022

Re #3:

A big downside is that you you can't easily take ownership of an existing buffer and treat it as this string type.

string s;

char some_buf[];

string_take(&s, some_buf, some_capacity);

Also you would of course never dynamically allocate string structs, just the data member if needed.

kazinator · on Dec 3, 2022

> Attempting to split a string using non-existent delimiter with str_pop_first_split() [returns an invalid string with .data == NULL].
But that seems like a valid case: e.g. these are comma-delimited lists of numbers:
"" // empty "1" // one number "20,30" // two numbers
the above remark in the documentation seems to be saying (perhaps falsely) that if we try to extract a token from the "1" string using "," as a delimiter, we get an invalid str_t rather than "1".

I don't see coverage for this in the tests. There is a test which uses "123/456/789", which extracts the first two splits, and then just verifies that "789" remains. What the programmer wants is to be able to write a loop which will extract "123", "456" and "789" and then* hit the terminating case where the invalid str_t is returned.

How many items are in "1,2,3," viewed as comma-separated: three or four?

It would also be a code improvement to replace umpteen repetitions of "(str_t){.data = NULL, .size = 0}" throughout the code with a macro.

mickjc750 · on Dec 5, 2022

Thank you kazinator! I quickly realised you were right about this. It's now fixed. I may have posted this project a little early, but on the other hand it's great to get others input and read all this discussion.

mickjc750 · on Dec 5, 2022

That is a good point... Perhaps str_pop_first_split() should pop, even if no delimiters are found. I'll give this some thought. I'll put that macro in too. Thanks.

andrewmcwatters · on Dec 3, 2022

I want C strings that are compatible with string.h.

I want some struct that is a pointer to the char array `s’ with size_t `n’.

To meaningfully do this, it means you need auxiliary functions that you execute after calling string.h functions, or you write wrappers that do this for you after calling the relevant string.h functions.

I’m OK with that.

SDS doesn’t do this. Most other C string libraries like this one basically do what I’m asking for, but not quite.

I don’t want separate structs for reading and writing strings. I just want authors to keep it as simple as possible without diverging too hard from how C strings already work today.

kevin_thibedeau · on Dec 3, 2022

I have a personal lib that works like this. It maintains a simple struct with a start pointer and a one-past-the-end pointer. You can use it to construct a view or point into unused space at the end of a string for building ops. NUL termination is preserved so interop with stdlib is always available.

This allows for nicer string handling while always allowing interop with anything expecting a char *. Libraries with their own string implementation always exact a penalty to get a cstr out.

program · on Dec 3, 2022

It's better not to use types that end with a '_t' because the suffix is reserved in POSIX systems.

https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh...

eps · on Dec 3, 2022

Yeah, that's the lowest-hanging C pedantry nitpick.

Usually if there's nothing else meaningful that one can say about someone else's project, they will comment on the _t naming... and as anyone with a yota of real-world experience would know it's a complete non-issue outside of a handful top-tier open source projects.

Don't be that guy. Save this comment for when it may actually be relevant.

inshadows · on Dec 5, 2022

You're idiot

dang · on Dec 6, 2022

You can't post like this here. We've banned the account.

If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.

kazinator · on Dec 3, 2022

I don't agree with that at all; it is "the sky might fall" reasoning.

Just

* have sane naming in your program.

* respect namespaces like _[A-Z] and __

* solve clashes that actually happen

Historically, revisions of POSIX have introduced identifiers that were not in any previously announced namespace. There is no way you can name an identifier that is guaranteed not to clash with POSIX, or any other vendor. For instance the name "openat" was fine to use in a POSIX program once upon a time.

Consider that all strings have the empty string as a suffix. The string "abc" has four suffixes: "abc", "bc", "c" and "".

So, every current and future POSIX identifier has "" as a suffix. This is not just a threat; it is guaranteed! Since every identifier in your program also has a "" suffix, it clashes with that namespace.

What's wrong with the argument is that identifiers don't just have a suffix; they have to be identical in order to actually clash. (Or have identical prefixes, due to truncation of external names in a linker: decades ago, the limits were ridiculously small.)

I doubt that even one person in POSIX standardization would be dumb enough to approve str_t being added as a typedef name in some existing or new header, and multiple approvals are required.

Nobody should be losing any sleep over using _t typedef names in their C code.

Kwpolska · on Dec 3, 2022

The argument with "" as suffix sounds quite absurd.

Why do you believe POSIX would never approve a str_t type? Nobody likes raw char arrays, perhaps a future revision of POSIX may decide to make the lives of C programmers easier and implement their own sane string type.

kazinator · on Dec 3, 2022

Yes, the argument is absurd; it's supposed to be.

Now extending the suffix to "_t" doesn't make it much less absurd. Not qualitatively, just a bit quantitatively less absurd.

Why I suspect POSIX isn't about to add a str_t is that str_t is likely to occur in countless numbers of unknown existing code bases.

And that might be a good reason for avoiding it in a library API, not the _t namespace being reserved.

We can have this variant of the argument: most identifiers end in a lower-case letter, so they land into any one of 26 namespaces: the *a namespace, the *b namespace, ... future POSIX identifiers have to be in one of these 26, except those that end in digits or underscores. POSIX does not say "future versions of this standard shall not claim new function or other identifiers ending in e". That doesn't mean you stay away from identifiers ending in "e", right?

I wouldn't avoid str_t in the internals of a program though. In the worst case, a clash happens somewhere and we do some renaming; life goes on.

POSIX's reservation doesn't really mean much; all they are saying is "we have some type names ending in _t, and will likely have more, so watch out". Yes, POSIX will likely have such names, and so will every C programmer and his dog. Whoppee dee. POSIX will likely have new names ending in 'e' also, and so on.

jstimpfle · on Dec 3, 2022

I for one like "raw char arrays", and really don't care about missing string functionality in C. I basically use sizeof, snprintf, memcpy and am just fine. I've toyed with defining struct String{ptr,size} sometimes but largely it just gets in the way.

If you think it's necessary, it's very easy to make an argument that you'd have to have a generic type for slices of any type. (Actually, more so than strings, since C is just not a language for domains with a focus on strings).

Now, whether you think a language must have a generic slice type or not, C is simply not the language where you can fit that in.

cassepipe · on Dec 3, 2022

On the other hand it is quite handy as a prefix, s_ for structs, e_ for enums, g_ for globals, t_ for simple typedefs, f_ for function pointers typedefs, u_ for unions... sky is the limit !

And it's quite easy to create an highlighting rule for it in vim if you still did not convert to treesitter. Just put in ~/.vim/after/c.vim :

```

syn match cType /\<\(t\|s\|e\|u\)_\w\+\>/

```

Boom, custom type highlighting for C ! Pick the the letters you will use.

lifthrasiir · on Dec 3, 2022

No wonder why a significant portion of C programmers actually want to keep tags (`struct` in `struct foo`) instead of removing them wih typedef.

mh7 · on Dec 3, 2022

It's reserved in the same sense that google's style guide 'reserves' struct names starting with a capital letter.

Only ISO C can officially reserve names, everyone else just has their personal code/naming style that you can chooseto follow or not.

lifthrasiir · on Dec 3, 2022

I would argue in the other way: the C standard should have a standard string type named `str_t`, and this library is one way to prototype it ;-)

jandrese · on Dec 3, 2022

This is kind of a bikeshed argument, but I'd prefer if the view was labeled as such. So instead of str_t it would be strview. Rust makes this same mistake IMHO and it causes a lot of confusion for beginners. I would personally call the strbuf_t strstore but that's even more nitpicky.

Naming things is one of the hardest problems in CS.

mickjc750 · on Dec 5, 2022

Hmm. I think it's a good argument. I'll think on it for a day or 2. I went to some effort to explain the difference between str_t and strbuf_t in the readme, but if someone want's to modify some code using this library one day, it's unlikely they will have seen the readme. In 10 years I'll probably have forgotten how this whole thing works, and that someone may be me.

Diggsey · on Dec 3, 2022

Slightly ironic that Rust is criticized for having multiple string types, and yet the solution to simplify string handling in C is to introduce the exact same types (str_t == &str, strbuf_t == String) albeit without the safety guarantees.

unclad5968 · on Dec 3, 2022

I don't think anyone minds that rust has multiple string types just that they're effectively named the same thing so people new to rust have no clue which does what without looking it up. Furthermore people without c/c++ experience mostly wont even know there is a difference since most languages don't give you that control over strings.

If rust string were str and strvec or strbuf no one would care.

estebank · on Dec 3, 2022

It is still frustrating to me that C still doesn't have a non-allocating method to handle substring references, which both C++ and Rust have. On the other hand I see people trying to parse files, like JSON, in a non-allocating way in Rust and hit a wall until they realize that nodes need to be escaped for anything useful, which requires owning the node's memory (meaning, you need a String or at least Cow<'_, str>, can't get away with a &str).

thesz · on Dec 3, 2022

Having a string type that has "invalid string" value which is different from empty string value is a bliss.

What is important there is that the invalid string value is completely compatible with most C functions - despite actual data pointer is NULL, the length of data is zero so memcmp, memmove/memcpy and most other functions will not segfault.

This is really thought out approach.

Thank you!

habibur · on Dec 3, 2022

I use a lib like this, but a few changes.

    printf("The string is %"PRIstr"\n", PRIstrarg(mystring));

Simpler: printf("the string is : %.*s",mystr.size, mystr.data)

But that's tedious to write. So create a small macro

    #define ls(x) (x).size,(x).data

And then printf becomes as simple as :

    printf("the string is : %.*s", ls(mystr));

Though OP's macro is possibly doing more.

masklinn · on Dec 3, 2022

> But that's tedious to write. So create a small macro

> #define ls(x) (x).size,(x).data

Doesn't that double-evaluate `x`?

habibur · on Dec 3, 2022

It does.

naasking · on Dec 3, 2022

Why not ropes?

https://github.com/josephg/librope

abcd_f · on Dec 3, 2022

An empty rope is 16 bytes on 64bit machines and whooping 24 bytes if it's a wchar rope. In some cases it might be ok, in many others it's not.

the-printer · on Dec 3, 2022

Are the any good resources that explain the concept of strings in C, particularly why they’re considered to be so difficult to manage? I’m interested in the language, and that along with its safety concerns seem to be the two most frequent complaints against it that I read about online.

jstimpfle · on Dec 3, 2022

"Strings" are quite an abstract concept. They are a linear sequence of characters. But there are a number of ways to represent them - the simplest of which is a contiguous memory allocation, but depending on the use case you'd need more complex schemes. There are also different ways to do the necessary memory management (e.g. allocate statically at compile time vs dynamically at run time).

One of the most complex representations is probably the string rope datastructure - a balanced tree of string chunks, supporting efficient insertion and removal anywhere in the string.

Specific to C, as well as lots of low-level APIs, is only that strings are often expected to be contiguously laid out in memory and terminated with a NUL (0) byte. So you need to make sure that you always terminate with a NUL after writing to string storage.

Other than that, strings aren't any harder than other aspects of programming with manually managed memory.

Maybe motivated from higher-level dynamic or managed languages, is the popular idea that strings should always be allocated dynamically (like std::string for example), and support operations like string-append with automatic reallocation if the currently allocated memory isn't enough to store the new string.

In practice, that's not true at all - unless you are in a domain where lots of small intermediate strings are generated. This is pretty inefficient anyway and there is likely no point to use C in this case.

By far most strings in most domains are either completely static (use string literals), or are created once in a sequence of append operations and then never changed again. I get by, doing many different things from GUI apps to networking to parsers and interpreters, without any sophisticated string type. All I do is define some printf-like APIs to do logging, for example. Those typically just use a fixed size buffer for the formatting, and then flush that buffer to e.g. stderr. or flush it to a dynamically allocated memory buffer, but there almost never is a need to reallocate that string later.

WalterBright · on Dec 3, 2022

> why they’re considered to be so difficult to manage?

Back in the 90s, I was very experienced with C strings and managing them. Then I chanced to look at BASIC again, and realized that strings in BASIC were so simple and intuitive. Why couldn't C be like that? When I started on the design of D, I decided that it had to make strings as easy to do as BASIC did.

And D does.

The trouble with C strings is the 0 termination of them. This means:

1. to get the length of the string, you have to scan it. This is expensive.

2. when manipulating strings, a common error is to get off by one in the storage because of the 0 termination

3. you cannot get a subset of the string without making a copy. Not only is a copy expensive, but then you have to keep track of the memory for it

4. there's no way to check for buffer overflows

D's design, which uses a phat pointer (length, ptr) for strings, solves these problems.

tored · on Dec 3, 2022

A year ago I picked up the BASIC dialect PureBasic. Pleasant surprise actually, the syntax of the PureBasic dialect is a bit archaic, but if you accept that it is much easier and faster to get anything done compared to C (and C++). Personally I find low level topics easier to grok in PureBasic than in C even though they mirror the same concepts. PureBasic has Unicode strings built in.

It is a bit shame that BASIC has such a bad reputation, there are many BASIC dialects that does the job well still today.

https://www.purebasic.com/documentation/reference/ug_string....

Someone · on Dec 3, 2022

The problem is that C doesn’t have strings; it has functions that treat sequences of non-zero bytes followed by a zero bytes as if they are strings.

So, you can’t ask it to create a string that contains the result of appending a string to another one. If you want to append two ‘strings’, you have to create a buffer large enough to hold the result, and then copy in the two sequences of bytes. And even for doing that, the library functions aren’t optimal. The basic “append this string’s data to that string, assuming there’s enough space to do so” function is strcat. It walks the first string to find the zero byte, but to “create a buffer large enough to hold the result” you already must do that.

See for example https://stackoverflow.com/questions/21880730/c-what-is-the-b...

jstimpfle · on Dec 3, 2022

You can use snprintf to easily achieve any concatentation you'd like.

    len = snprintf(buffer, buffersize, "%s%s%d", string_1, string_2, int_1);

    if (len + 1 /*NUL*/ > buffersize)
    {
        // not enough space
    }

You can also use this to dynamically allocate any formatted string

    len = snprintf(NULL, 0, ....);

    buffersize = len + 1;  /* NUL */
    
    buffer = allocate(buffersize);

    snprintf(buffer, buffersize, ... /*same args as before*/);

tom_ · on Dec 3, 2022

I like the printf family too. Any time you're doing a bunch of strcat or whatever it's almost always massively easier to use a format string to get the same result. Very easy to get the desired width/precision/alignment, and if you need numbers, printf has your back. It even does the bounds checking for you! (And how often do you get that in C.)

It won't be as fast, but it's almost always not a problem, and the nice thing about C and C++ is that the char-by-char route is still available when it is.

I like to use asprintf, when available: https://man7.org/linux/man-pages/man3/asprintf.3.html - and when not available, I add it, along the lines of the snippet you present.

Here's something I've found a useful upgrade to asprintf, as it frees the passed-in buffer after expanding the format string. You can just pass the same char ** repeatedly and it'll update the char * appropriately each time.

    int xasprintf(char**p,const char *fmt,..) {
        int n=0;
        char *p2=nullptr;
        if(fmt) {
            va_list v;
            va_start(v);
            n=asprintf(&p2,fmt,v);
            va_end(v);
        }
        if(n>=0) {
            free(*p);
            *p=p2;
        }
        return n;
    }

e-dant · on Dec 3, 2022

C strings are pointers to memory. There are semantics and assumptions encouraging null-character delimited strings, but not every API follows those rules (just got done working with a Windows API that doesn’t).

Often, you have to both null-delimit your string and store its length somewhere. That’s the dangerous part. Messing either of those up, or passing your string to an API that messes that up, is not safe.

C strings are pointers to memory, either the stack or the heap, and follow exactly the same rules as everything else in that chaotic space: Not many.

the-printer · on Dec 3, 2022

Thank you for this. C programming sounds almost like some sort of combat sport. Riveting.

lelanthran · on Dec 3, 2022

> Thank you for this. C programming sounds almost like some sort of combat sport. Riveting.

I've done it for decades; it isn't really as bad as hype-attracting headlines would have you believe.

Munitions control, aircraft management systems, industrial automation systems, and many more life-critical systems were programmed in C for decades with comparatively little danger from the language intrinsics leading to death.

It's easy to look at the stats and say "there's a few dozen CVEs annually due to C footguns", but that's a few dozen out of hundreds of millions of deployed systems that are written in C.

In practice, very few lines of C code bypass the type system, so you get much fewer bugs than an equivalent system in the more usual dynamic programming languages (Python, Javascript, etc).

thesnide · on Dec 3, 2022

Wondering if the big influx of C derived CVE are old or new code. If it is new code, I'm also wondering about the brain damage that those safe languages causes.

Yes, it is better to have memory safe languages. But it encourages sloppiness as "nothing can happen". Then those folks aren't fit to write anything else. Which closes the feedback loop on inefficient but safe languages.

Which becomes the same thing in airplanes. Pilots don't really know how to fly without instruments anymore.

wadd1e · on Dec 3, 2022

>Which becomes the same thing in airplanes. Pilots don't really know how to fly without instruments anymore.

Well that's just a blatantly wrong generalisation you made there, curious as to where you got that from. Consider looking up how pilot training is done before making such assumptions. Even though modern airplanes make heavy use of technology, there are emergency scenarios where lots of instruments may not work, and pilots receive more than enough training to fly an airplane in that scenario just to give one example among tons of others.

edit: grammar

thesnide · on Dec 3, 2022

It seems there's a difference between theory and practice.

https://www.businessinsider.com/too-many-pilots-cant-handle-...

Now, I'm not an expert into pilots statistics, so my example might be off, but I do see a worrisome pattern in my daily work (software engineering). Blind reliance on those "frameworks".

Which isn't bad in itself, but no-one really knows how they work anymore. They just assume. And that leads to lots of cargo cults. Which ranges from inefficient to outright dangerous.

marssaxman · on Dec 3, 2022

More like fire-performance: it looks dangerous, and it does require some finesse, but it's really satisfying when you get in the flow, and burns are both less frequent and less serious than you might imagine as an onlooker.

pjmlp · on Dec 3, 2022

It is more like a combat sport, doing martial art moves, while trying to juggle knives between moves.

Quentak · on Dec 3, 2022

I have written a short article explaining why null terminated strings as they exist in C cannot represent proper ASCII and UTF-8 because of the null terminator. It's not a full explanation of how strings work but it might be helpful for you.

https://kttnr.net/blog/null-terminated-strings-are-incorrect...

lifthrasiir · on Dec 3, 2022

SQLite does store a null character in strings, it has lots of documented [1] issues in the API level though.

[1] https://www.sqlite.org/nulinstr.html

Quentak · on Dec 3, 2022

Thanks for this link.

How do you get the null byte into the string? Is it through casting blob to string? The way I have encountered this is when using the C API in which string arguments for prepared statements are passed as char pointers. If those contain the null byte then the string is cut off.

Allowing null characters and then mishandling them is worse than not allowing them.

musicale · on Dec 3, 2022

> str.h defines the following str_t type:

    typedef struct str_t { 
      const char* data;
      size_t size;
    } str_t;

Sort of a hybrid of C style (pointer) and Pascal style (bounded array) strings?

pjmlp · on Dec 3, 2022

This is the kind of string libraries that WG14 should care about.

Kudos for having a go at it.

gkfasdfasdf · on Dec 3, 2022

Hoping someone can educate me, what are the advantages of having the last member of strbuf_t be a variable length array (char cstr[]) instead of just a char*?

ComputerGuru · on Dec 3, 2022

You can store the string as part of the same heap/stack allocation rather than as a separate allocation.

ksherlock · on Dec 3, 2022

With inline data, only one malloc is needed for the buffer housekeeping and character data. It's also probably slightly better for cache performance since the housekeeping data and string data are together.

gkfasdfasdf · on Dec 3, 2022

Ah I see. If you want to refer to a string that was not part of this allocation you would use the other str_t type,

e-dant · on Dec 3, 2022

I guess it’s nice for a C string API, but what’s the motivation to use and create this? Wouldn’t externing some C++ symbols (or Rust) work more smoothly?

lelanthran · on Dec 3, 2022

> Wouldn’t externing some C++ symbols (or Rust) work more smoothly?

For the C++ case, it's not that easy due to C code that cannot handle exceptions thrown in C++ code.

For the Rust bit, I'm not sure - creating the library in Rust and letting it be called from C makes the whole rust library unsafe because the data returned from the Rust API would lose ownership information, and is no more safe than simply writing it in C.

zajio1am · on Dec 3, 2022

1. Ditching null termination makes it cumbersome for interoperability with C ecosystem.

2. It has terrible overhead.

LegionMammal978 · on Dec 3, 2022

> All strbuf functions maintain a null terminator at the end of the buffer, and the buffer may be accessed as a regular c string using mybuffer->cstr.

So effectively a str_t works like an std::string_view from C++, and strbuf_t works like an inline std::string.

To produce a null-terminated string from a section of a longer string requires an allocation, unless you can temporarily modify the original string to replace one of its characters with a terminator.

zajio1am · on Dec 3, 2022

Well, the documentation says that null terminator is maintained at the end of the buffer (i.e. mybuffer->cstr[mybuffer->capacity - 1]), not at the end of the string stored in the buffer (i.e. mybuffer->cstr[mybuffer->size]).

LegionMammal978 · on Dec 3, 2022

Not sure where you're getting that interpretation from. If you look at the actual code, it sets buf->cstr[buf->size] = 0 every time the string is resized. After all, what else could "the buffer may be accessed as a regular c string" possibly mean?

zajio1am · on Dec 3, 2022

> Not sure where you're getting that interpretation from.

That is just plain reading of "null terminator at the end of the buffer", as 'buffer' is just place in memory, regardless of what is stored in it. 'End of the buffer' is commonly used for end of such reserved memory, not end of valid data in that memory.

But maintaining the null-terminated string in the buffer is much more useful behavior than just maintaining null terminator at the end of the buffer, so it is likely just sloppiness in the documentation.

gjvc · on Dec 3, 2022

worth comparing to https://cr.yp.to/lib/stralloc.html

plan999 · on Dec 3, 2022

You should look at an even better string library. Much more functions and safety for split/joint/tokenizer/etc it's a fork of the plan9 string library bstring.

https://bstring.sourceforge.net/

kaba0 · on Dec 3, 2022

All in all, C is still not expressive enough for even such a basic data structure as strings.