Str: Yet another string library for C language

abqexpert · on Nov 25, 2020

C2x is expected to gain the char8_t type and almost certainly will not gain any new string handling routines. In the year 2030 I am expecting to still see more rounds of posts comparing string libraries for C. With more than 4 decades of development, we still don't have a great solution to string handling in C.

Every library like this is incompatible and most make slightly odd choices like in this library ownership of the string is denoted by a bit in the info/size field. Not that that is a bad choice or anything, but it is one reason someone might decline to use it and decide to write their own.

The lack of namespacing in C doesn't help, this library chooses str_ as its prefix, which is a bit likely to collide with other libraries. It also makes it harder to try to write libraries that allow for the string library to be switched out.

Mikhail_Edoshin · on Nov 26, 2020

Maybe we don't have a single solution to string handling in C because there's no such thing as a "string". A text editor would need a very different "string" than a typesetting system or an indexing engine.

josephcsible · on Nov 25, 2020

> almost certainly will not gain any new string handling routines

It looks like it'll be getting strdup and strndup.

abqexpert · on Nov 25, 2020

Thanks for that! I hadn't seen it yet, https://en.cppreference.com/w/c/string/byte/strdup says it will be C23 as well.

EdSchouten · on Nov 26, 2020

Fun fact: 'str*' is also reserved by POSIX for future extensions. This means that this library completely collides with POSIX.

lifthrasiir · on Nov 26, 2020

Note that ISO C only reserves /^str[a-z]/ for <string.h> and <stdlib.h>. POSIX's reservation of /^str_/ (note that you are still allowed for identifiers like str8line) is for STREAMS (<stropts.h>) [1], a completely different thing from strings...

[1] https://en.wikipedia.org/wiki/STREAMS

flohofwoe · on Nov 26, 2020

IMHO it's better to ignore "universal string processing" in the C standard completely instead of providing a half-assed and over-complicated solution like in most other languages. String processing isn't exactly a core-competency of C, and never will be, it's better to use a different language for such things.

saagarjha · on Nov 26, 2020

It’s getting memccpy, which is an extremely important function for the creation of a performant, safe string copying method.

pjmlp · on Nov 26, 2020

It is impossible to be safe if size is a function argument that cannot be validated without hardware support.

saagarjha · on Nov 26, 2020

My definition of safety likely differs from yours.

pjmlp · on Nov 26, 2020

My definition of safety means having a size greater than the actual string doesn't turn an innocent looking call into a CVE database entry.

I bet the security industry agrees with my definition.

bzb6 · on Nov 26, 2020

Isn’t a function like that trivial to implement?

fefe23 · on Nov 25, 2020

I think it is good if everybody implements their own string library.

It builds character and you learn something from it.

It is a rite of passage.

However, don't add to the pile of dependency hell that is already plaguing many open source projects. If you feel uneasy with how C strings work, consider switching programming languages instead. You will probably have an easier time and there will be less unmaintained incompatible string libraries rotting around on github.

pengaru · on Nov 25, 2020

> However, don't add to the pile of dependency hell that is already plaguing many open source projects

If you create small libraries that don't produce shared objects intended to stand on their own with a stable API/ABI, but are simply headers or at most produce a .a from source fully intended to become vendored in-tree, you're not contributing to "dependency hell".

nopurpose · on Nov 26, 2020

In theory this vendored library might show up multiple times in a dependency tree and be incompatible with each other.

pengaru · on Nov 26, 2020

It's hardly a "dependency hell" when it only affects developers, in what are essentially unique collision type situations, and are generally addressable by the developer because the source is all present. And upstream maintainers of such intended-to-be-vendored code should generally be receptive to improving compatibility and build system configurability for such situations. And if they're not receptive/it's abandonware, congratulations your vendored library is now a fork and fix it yourself.

When we refer to "dependency hell", AIUI, it's in reference to unresolvable runtime dependencies creating hell for end-users.

paledot · on Nov 26, 2020

> It builds character

That's awful and I love it.

dvfjsdhgfv · on Nov 25, 2020

> I think it is good if everybody implements their own string library.

...until you someone exploits the bugs in it.

Everyone who did the exercises in K&R should be able to write their own string library, probably with less bugs than the standard one. However, I really feel it's much better for everyone to use proven code like bstring.

saagarjha · on Nov 26, 2020

> probably with less bugs than the standard one

Fewer bugs than what “standard one”?

dvfjsdhgfv · on Nov 26, 2020

Whatever implements string.h on your system, and other functions dealing with string input. Some of these functions simply shouldn't be used at all. An extreme case is gets() that was phased out, but many others are no better.

saagarjha · on Nov 26, 2020

The string functions in your system are almost certainly less buggy than anything you’re going to write.

dvfjsdhgfv · on Nov 26, 2020

You're kidding, right? We're not talking about the implementation, but the design. If I ever wanted to write a gets() replacement, it would definitely have proper checks in place to prevent buffer overflow. Everyone using strcpy() is playing with fire. You'll get it right 9 times and make a mistake the 10th time. It's not that the people who implemented these functions are stupid, but they were designed in different times for other types of environments.

creata · on Nov 26, 2020

> probably with less bugs than the standard one… We're not talking about the implementation, but the design.

That's not how most people use the word "bug".

dvfjsdhgfv · on Nov 26, 2020

Literally from the man page of gets():

BUGS: Never use gets().

GoblinSlayer · on Nov 26, 2020

bstring is allocated on heap, so slicing requires allocation.

azhenley · on Nov 25, 2020

Every time I think to use C for something, I re-realize how terrible it is to do anything involving strings. Although this library looks nice, I’ll still have to compose and manage them myself , which is a major headache.

WalterBright · on Nov 25, 2020

When I review C code, I look for strncpy, etc., and give them special attention. There's always a bug or two in it.

0-terminated strings not only have proven to be a rich source of bugs, they're remarkably inefficient as well [1]. Doing better was a major focus of the initial design of D.

[1] This is because of constantly scanning to get the length (which also necessitates reloading the string contents into the memory cache), and having to make copies of strings instead of just slicing them.

aidenn0 · on Nov 25, 2020

Well you don't have to make copies of strings to slice them, just ask strtok! </s>

abqexpert · on Nov 25, 2020

If you wanted to slice at an arbitrary point, then you would either have to lose some data in the original string, move or copy the original string to make space for the extra delimiter/null character, or have set up the string ahead of time to contain the delimiter in the desired position. If you are using strtok.

souprock · on Nov 26, 2020

There is also the mangle-use-repair choice. I've done that with pathnames for creating nested directories.

C programmers are expected to make the best choice based on the situation. The various choices trade off memory usage, CPU usage, source code readability, and program correctness.

account42 · on Nov 26, 2020

> There is also the mangle-use-repair choice.

Which is problematic for thread safety and depending on the source of the string (constant) may not be possible.

souprock · on Nov 27, 2020

It's not problematic. C programmers are expected to avoid screwing that up. C is a full-power language.

If available, strdupa() would be a fine way to get a suitable local copy of the string. Commonly though, the programmer knows that there will not be threads and can make the string non-constant.

codezero · on Nov 26, 2020

I encountered Hollerith constants in an ancient Fortran codebase I worked on and was thrilled to see folks were doing clever stuff with strings in the 60s.

I wonder how much time was wasted in early computing (maybe not wasted really) because of the fear of incompatibility that is getting smaller and smaller as computing platforms coalesce into standardized-ish things.

Watching the M1 roll out and how it doesn't seem to care much that x86 is a thing and gets along with its life has been fascinating.

pansa2 · on Nov 25, 2020

How is the support for strings in C++? Presumably better than in C, but is it good enough when compared to other compiled languages - Go, Rust etc?

WalterBright · on Nov 25, 2020

D uses "phat pointers" for strings, aka a length/pointer pair. Over the years, this has proven to be simple, efficient, and resistant to errors. It means array bounds checking can be automatically done. It enables efficient slicing.

String literals also have an extra 0 appended, making it transparently easy to still pass strings to C functions like printf.

abqexpert · on Nov 25, 2020

>"phat pointers"

I don't know if that is a typo given you normally call them "fat pointers", but they are "pretty hot and tempting".

__d · on Nov 25, 2020

C++ has string support in the standard library.

It doesn't have the same breadth of features as, say, Python's string class, but it's ok.

See, eg. https://en.cppreference.com/w/cpp/string

emmanueloga_ · on Nov 26, 2020

In my opinion, still a bit hairy, the reason something like nowide exists [1].

1: https://www.boost.org/doc/libs/develop/libs/nowide/doc/html/...

kenniskrag · on Nov 25, 2020

How is the support of unicode in c++?

PaulDavisThe1st · on Nov 26, 2020

C++ itself doesn't support it. There are libraries that provide unicode-aware handling of strings/vectors of bytes. It's not always clear that you want unicode-aware code when dealing with unicode, but there are times when it is nice to have.

ansgri · on Nov 25, 2020

It's exceedingly verbose, but decent, if you have a recent language version and/or Boost.

herodoturtle · on Nov 25, 2020

Gosh this evoked a keen sense of nostalgia. I don't miss C strings at all. Along with segmentation fault, core dumped. Ruined many a night!

pcdoodle · on Nov 25, 2020

I know. This and the fact that compliers never bitch about a single = in if() statements really take time out of my life...

WalterBright · on Nov 25, 2020

Most C compilers will give a warning for that. D makes it an error in the grammar. `a < b < c` is also an error in the grammar.

thechao · on Nov 26, 2020

I gave up trying to teach C after 2 years teaching it at university. You’ve been at it, what? 25+ years? Mad props. Thanks for great C/++ compilers, and double-thanks for D!

unwind · on Nov 25, 2020

Always interesting with C posts! Two notes:

- I'm pretty sure public symbols starting with "str" are reserved by the standard.

- Declaring function arguments as const is pretty silly for value types, imo.

1wd · on Nov 26, 2020

Function names starting with "str" followed by a lowercase letter are reserved. So technically "str" itself and "str_" are not.

Google234 · on Nov 26, 2020

Doesn’t it stop you modifying them inside the function?

account42 · on Nov 26, 2020

I'm not sure about C but at least in C++ having const on the prototype is meaningless as you can still have the arguments as non-const in the actual definition. Considering that C is usally less strict with these things I'd expect that to be the case there too.

GoblinSlayer · on Nov 26, 2020

String isn't a value type.

unwind · on Nov 26, 2020

Yes it is, it's a small struct.

cassepipe · on Nov 26, 2020

It would be a pity not to mention here the Simple Dynamic String (SDS) library made by the maker of Redis : https://github.com/antirez/sds

It is also very well documented. And all you need to embed it in your project is : sds.c sds.h sdsalloc.h

The source code is small and every C99 compiler should deal with it without issues.

lifthrasiir · on Nov 26, 2020

It seems that everyone implementing their own string library (including, eh, antirez) thinks masquerading pointers is cute, but in my opinion and experience it's very dangerous because it requires a specific coding convention that can't be checked by compilers. SDS is no exception to this problem:

    sds a = sdsnew("hell");
    sds b = a;
    a = sdscat(a, "o"); // this invalidates b

Masqueraded pointers are inherently linear (or affine if you are pedantic). Any length-changing updates to such pointers can potentially reallocate them, so any value can't be "updated" more than once; values should be consumed and returned by many operations. No typical C types behave like this: primitive values or structs can be updated by assignments and pointers can be updated by dereference. C doesn't support linear types and, while normal pointers do need care, masqueraded pointers need much more care to use correctly. Yes, you can replicate the same bug with normal pointers by replacing the third like to `free(a);`, but you wouldn't expect a bug for non-destructive operations. (Put in the other way, masqueraded pointers make many otherwise non-destructive operations destructive.)

While technically not a string library, this and the strict-aliasing issue for type-generic routines prompted me to write my own small extensible vector library [1] years ago.

[1] https://gist.github.com/lifthrasiir/4422136

andrewshadura · on Nov 25, 2020

Interesting: this library makes use of C generics, so you can str_join a str into a str or into a FILE.

recursivedoubts · on Nov 25, 2020

oof bit packing to save a single bit?

https://github.com/maxim2266/str/blob/f4e84657b23977ab3c5cd7...

seems unlikely to matter if you have a bunch of strings flying around...

two features I'd love to see implemented:

- wrapping thread safe tokenization using strtok_r so it's pleasant to tokenize a string

- sprintf-like formatting

anything that improves string handling in C is doing God's work

jsnell · on Nov 25, 2020

It's not about saving a bit, but about saving an entire word. If the length and ownership weren't packed together, you'd need one more field in the str struct for the ownership bit, and due to alignment the minimum size increase would be a word.

kevin_thibedeau · on Nov 25, 2020

strtok_r() destructively modifies its input so wrapping it works against this library's objective of using const ponters. It is easy enough to reimplement, though, and I've done this for similar sub-string pointer objects that work without NUL termination.

souprock · on Nov 26, 2020

The existence of strtok_r() is weird anyway. If we can make errno thread-safe, there is no reason why plain strtok() can't be thread-safe. The idea that somebody wants strtok() state shared across threads is just as weird as the idea that errno should be shared across threads.

kevin_thibedeau · on Nov 26, 2020

The C API was never designed with threading in mind. Not all platforms support threads so the old behavior must stay around.

souprock · on Nov 27, 2020

Say what? You like to call plain strtok() from different threads, having them all update a shared global state? I think you missed the point here, because that would be some really evil usage of the strtok() function.

We had "int errno" as global state. We fixed it, in a compatible way, to be thread-safe. Platforms without threads can still implement it the old way if desired.

The same kind of compatible fix could have been done with the strtok() function. There was no need to introduce another function.

Simply: the internal state of strtok() shall be distinct for each thread. (which is trivial if the platform only supports a single thread)

chmaynard · on Nov 25, 2020

Strings of ASCII characters only? Or can this library be used with Unicode as well? Just asking.

shakna · on Nov 25, 2020

Depends what you mean by Unicode.

All the UTF-8 codepoints can be held inside an 8bit char, which is what this library seems to use under the covers.

You might need to add a couple UTF-specific methods if you want number of graphemes rather than number of bytes, but there's nothing to stop you placing UTF8 data inside a char buffer.

Xophmeister · on Nov 25, 2020

I don’t know what you mean by “all the UTF-8 codepoints can be held inside an 8bit char”. All Unicode codepoints obviously cannot be held in 8-bits. The UTF-8 encoding matches ASCII over the first 7-bits, but that’s not relevant. You can UTF-8 encode Unicode codepoints into a bunch of 8-bit chars, but then you can encode anything you like into a bunch of 8-bit chars; a JPEG file for instance.

shakna · on Nov 25, 2020

> All Unicode codepoints obviously cannot be held in 8-bits.

I mentioned UTF-8 specifically, because the UTF-8 encoding actually does specify this particular feature:

> UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. [0]

[0] https://en.wikipedia.org/wiki/UTF-8

stonemetal12 · on Nov 25, 2020

>Unicode using one to four one-byte

UTF-8 is a variable length encoding, some characters need 8 bits, other characters need 32 bits per your own quote.

shakna · on Nov 25, 2020

> one to four one-byte (8-bit) code units

And it is designed so that it fits in one-byte units. That was a central goal of the encoding.

So... If you have a char buffer, like we have been talking about, you can toss any valid UTF-8 sequence inside it.

Xophmeister · on Nov 26, 2020

It’s very rare that any data encoding doesn’t come in a multiple of 8-bits. Have you ever seen a 1.5 byte (12-bit) file? UTF-8 isn’t special in that regard. You can put literally anything into a char buffer and decode it how you like.

alarge · on Nov 26, 2020

The issue isn't whether or not your character encoding is always a multiple of 8 bits. It is whether or not you can use standard (octet-focused) parsing functions to deal with those strings. This is what makes utf-8 "special". No byte of a utf-8 multibyte sequence will ever have a value < 127. So for most "syntactic" parsing problems, you can use standard C functions to deal with utf-8 strings - something that is not true with most other multibyte character encodings.

account42 · on Nov 26, 2020

UTF-8 has an even stronger guarantee: If a byte sequence at any position in a UTF-8 string matches the byte sequence of a UTF-8 encoding of a Unicode code point then that part of the string represents that code point. This means you cannot just use standard C functions like strchr with UTF-8 strings and ASCII characters but you can alos use e.g. strstr to find UTF-8 substrings in UTF-8 strings.

Xophmeister · on Nov 26, 2020

Bytes are bytes. We’re not debating whether it’s easier to write a UTF-8 decoder; I’m asserting that (almost?) any data can be represented as a sequence of bytes and UTF-8 is not special in that regard.

Brian_K_White · on Nov 26, 2020

...merely, a[n] is not necessarily the nth charater.

Up to you to make sense of the data in singles, pairs, fours, and before that to declare the buffer as the appropriate multiplier plus 1.

globular-toast · on Nov 26, 2020

And what good is that? All computer memory is just an array of addressable bytes. So all you've said is you can store UTF-8 strings in memory. You still can't do random access on the string (ie. s[i] will not give you the ith character).

account42 · on Nov 26, 2020

For any definition of a character that is useful to anyone except a text shaping engine, neither will s[i] with UCS-4 (and definitely not with UTF-16).

globular-toast · on Nov 26, 2020

Erm... ASCII? The original comment in this thread was asking whether this supports anything but ASCII.

Google234 · on Nov 26, 2020

You can fit 32 bits inside 4 8 bit chars...

__d · on Nov 26, 2020

The most valuable property of UTF-8 from a C-string point of view is that it guarantees there are no embedded NULs in a UTF-8 string.

If you naively put the bytes of UTF-16 or UTF-32 encodings into a buffer, they might contain NUL (zero) byte values. Which, for C strings, means "end of string". UTF-8 makes sure this doesn't happen, which makes it compatible with existing C string functions.

aidenn0 · on Nov 25, 2020

See my link in sibling comment; the library supports decoding arbitrary encodings to unicode, even those with embedded NULLs.

fjfaase · on Nov 25, 2020

There is some support for reading a string as an encoded in the current program locale. See: https://github.com/maxim2266/str#unicode-support

aidenn0 · on Nov 25, 2020

https://github.com/maxim2266/str#unicode-support

reactordev · on Nov 26, 2020

> This is the good old C language, not C++ or Rust, so nothing can be enforced on the language level, and certain discipline is required to make sure there is no corrupt or leaked memory resulting from using this library.

I loved this. “This ain’t your fancy schmancy Tesla, it’s granddaddy’s old Ford pickup”.

keyle · on Nov 26, 2020

I loved that

> Disclaimer: This is the good old C language, not C++ or Rust, so nothing can be enforced on the language level, and certain discipline is required to make sure there is no corrupt or leaked memory resulting from using this library.

dvfjsdhgfv · on Nov 25, 2020

I'm a bit surprised why C11 support is needed. When you write a library like this, you usually aim for compatibility. There is a lot of ANSI C code around, including popular projects like SQLite. Yet I don't really see much of C11 features in this code except C++-style comments and inline functions that could be solved with simple #ifdefs.

souprock · on Nov 26, 2020

C++ comments and inline functions arrived with C99, which was 21 years ago, not C11. With C11, itself now an obsolete standard from 9 years ago, we get the _Generic keyword.

Avoiding the _Generic keyword is difficult. One might try using the sizeof operator.

rurban · on Nov 25, 2020

Ownership is nice, but how can a zero terminated buffer library still call itself string? We have unicode for a while. UTF-8 only, not less. Strings must support unicode.

This impacts sort and comparisons mostly. But without cmp you cannot search in strings.

flohofwoe · on Nov 25, 2020

UTF-8 strings can be compared with strcmp(), you just can't get alphabetical sorting out of it. Most other str*() functions also work with UTF-8 encoded strings, you just need to know what to expect (e.g. splitting with strtok() works as long as the delimiters are all 7-bit ASCII chars, etc...).

deathanatos · on Nov 25, 2020

> UTF-8 strings can be compared with strcmp()

No, they can't? These two UTF-8 byte sequences in a C char pointer,

  c3 a9 00
  65 cc 81 00

Represent the same string, but do not compare equal with strcmp.

And it's not just that; you've noted how strtok will break down. strchr() can't be used w/ a non-ASCII needle, there is no support for code units, etc.

drran · on Nov 26, 2020

These two strings will be equal after normalization and validation of UTF-8.

flohofwoe · on Nov 26, 2020

That's a problem with the UNICODE standardization process, and not a problem with the UTF-8 encoding though.

account42 · on Nov 26, 2020

> you just can't get alphabetical sorting out of it

Lexicographical sorting over UTF-8 strings is actually the same as lexicographical sorting over the corresponding Unicode code point sequence.

GoblinSlayer · on Nov 26, 2020

AFAIK, comparison is language dependent, but how do you tell the string's language and how do you compare strings from different languages and multilingual strings?

qwerty456127 · on Nov 26, 2020

There is the Unicode Collation Algorithm standard to address this. An example of its implementation is the utf8_unicode_ci collation in MySQL.

topspin · on Nov 25, 2020

> zero terminated buffer library

Indeed.

squid_demon · on Nov 26, 2020

Maybe I'm missing something but it doesn't look like memory allocation was carefully considered in this library. For example, there are no custom allocators?

Uptrenda · on Nov 26, 2020

Looks useful. Should make it single-header / single-file too. Regex would also be good to have.

jdright · on Nov 25, 2020

What about support string interning, any plans?