Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Str: Yet another string library for C language (github.com/maxim2266)
82 points by SiempreViernes on Nov 25, 2020 | hide | past | favorite | 95 comments


C2x is expected to gain the char8_t type and almost certainly will not gain any new string handling routines. In the year 2030 I am expecting to still see more rounds of posts comparing string libraries for C. With more than 4 decades of development, we still don't have a great solution to string handling in C.

Every library like this is incompatible and most make slightly odd choices like in this library ownership of the string is denoted by a bit in the info/size field. Not that that is a bad choice or anything, but it is one reason someone might decline to use it and decide to write their own.

The lack of namespacing in C doesn't help, this library chooses str_ as its prefix, which is a bit likely to collide with other libraries. It also makes it harder to try to write libraries that allow for the string library to be switched out.


Maybe we don't have a single solution to string handling in C because there's no such thing as a "string". A text editor would need a very different "string" than a typesetting system or an indexing engine.


> almost certainly will not gain any new string handling routines

It looks like it'll be getting strdup and strndup.


Thanks for that! I hadn't seen it yet, https://en.cppreference.com/w/c/string/byte/strdup says it will be C23 as well.


Fun fact: 'str*' is also reserved by POSIX for future extensions. This means that this library completely collides with POSIX.


Note that ISO C only reserves /^str[a-z]/ for <string.h> and <stdlib.h>. POSIX's reservation of /^str_/ (note that you are still allowed for identifiers like str8line) is for STREAMS (<stropts.h>) [1], a completely different thing from strings...

[1] https://en.wikipedia.org/wiki/STREAMS


IMHO it's better to ignore "universal string processing" in the C standard completely instead of providing a half-assed and over-complicated solution like in most other languages. String processing isn't exactly a core-competency of C, and never will be, it's better to use a different language for such things.


It’s getting memccpy, which is an extremely important function for the creation of a performant, safe string copying method.


It is impossible to be safe if size is a function argument that cannot be validated without hardware support.


My definition of safety likely differs from yours.


My definition of safety means having a size greater than the actual string doesn't turn an innocent looking call into a CVE database entry.

I bet the security industry agrees with my definition.


Isn’t a function like that trivial to implement?


I think it is good if everybody implements their own string library.

It builds character and you learn something from it.

It is a rite of passage.

However, don't add to the pile of dependency hell that is already plaguing many open source projects. If you feel uneasy with how C strings work, consider switching programming languages instead. You will probably have an easier time and there will be less unmaintained incompatible string libraries rotting around on github.


> However, don't add to the pile of dependency hell that is already plaguing many open source projects

If you create small libraries that don't produce shared objects intended to stand on their own with a stable API/ABI, but are simply headers or at most produce a .a from source fully intended to become vendored in-tree, you're not contributing to "dependency hell".


In theory this vendored library might show up multiple times in a dependency tree and be incompatible with each other.


It's hardly a "dependency hell" when it only affects developers, in what are essentially unique collision type situations, and are generally addressable by the developer because the source is all present. And upstream maintainers of such intended-to-be-vendored code should generally be receptive to improving compatibility and build system configurability for such situations. And if they're not receptive/it's abandonware, congratulations your vendored library is now a fork and fix it yourself.

When we refer to "dependency hell", AIUI, it's in reference to unresolvable runtime dependencies creating hell for end-users.


> It builds character

That's awful and I love it.


> I think it is good if everybody implements their own string library.

...until you someone exploits the bugs in it.

Everyone who did the exercises in K&R should be able to write their own string library, probably with less bugs than the standard one. However, I really feel it's much better for everyone to use proven code like bstring.


> probably with less bugs than the standard one

Fewer bugs than what “standard one”?


Whatever implements string.h on your system, and other functions dealing with string input. Some of these functions simply shouldn't be used at all. An extreme case is gets() that was phased out, but many others are no better.


The string functions in your system are almost certainly less buggy than anything you’re going to write.


You're kidding, right? We're not talking about the implementation, but the design. If I ever wanted to write a gets() replacement, it would definitely have proper checks in place to prevent buffer overflow. Everyone using strcpy() is playing with fire. You'll get it right 9 times and make a mistake the 10th time. It's not that the people who implemented these functions are stupid, but they were designed in different times for other types of environments.


> probably with less bugs than the standard one… We're not talking about the implementation, but the design.

That's not how most people use the word "bug".


Literally from the man page of gets():

BUGS: Never use gets().


bstring is allocated on heap, so slicing requires allocation.


Every time I think to use C for something, I re-realize how terrible it is to do anything involving strings. Although this library looks nice, I’ll still have to compose and manage them myself , which is a major headache.


When I review C code, I look for strncpy, etc., and give them special attention. There's always a bug or two in it.

0-terminated strings not only have proven to be a rich source of bugs, they're remarkably inefficient as well [1]. Doing better was a major focus of the initial design of D.

[1] This is because of constantly scanning to get the length (which also necessitates reloading the string contents into the memory cache), and having to make copies of strings instead of just slicing them.


Well you don't have to make copies of strings to slice them, just ask strtok! </s>


If you wanted to slice at an arbitrary point, then you would either have to lose some data in the original string, move or copy the original string to make space for the extra delimiter/null character, or have set up the string ahead of time to contain the delimiter in the desired position. If you are using strtok.


There is also the mangle-use-repair choice. I've done that with pathnames for creating nested directories.

C programmers are expected to make the best choice based on the situation. The various choices trade off memory usage, CPU usage, source code readability, and program correctness.


> There is also the mangle-use-repair choice.

Which is problematic for thread safety and depending on the source of the string (constant) may not be possible.


It's not problematic. C programmers are expected to avoid screwing that up. C is a full-power language.

If available, strdupa() would be a fine way to get a suitable local copy of the string. Commonly though, the programmer knows that there will not be threads and can make the string non-constant.


I encountered Hollerith constants in an ancient Fortran codebase I worked on and was thrilled to see folks were doing clever stuff with strings in the 60s.

I wonder how much time was wasted in early computing (maybe not wasted really) because of the fear of incompatibility that is getting smaller and smaller as computing platforms coalesce into standardized-ish things.

Watching the M1 roll out and how it doesn't seem to care much that x86 is a thing and gets along with its life has been fascinating.


How is the support for strings in C++? Presumably better than in C, but is it good enough when compared to other compiled languages - Go, Rust etc?


D uses "phat pointers" for strings, aka a length/pointer pair. Over the years, this has proven to be simple, efficient, and resistant to errors. It means array bounds checking can be automatically done. It enables efficient slicing.

String literals also have an extra 0 appended, making it transparently easy to still pass strings to C functions like printf.


>"phat pointers"

I don't know if that is a typo given you normally call them "fat pointers", but they are "pretty hot and tempting".


C++ has string support in the standard library.

It doesn't have the same breadth of features as, say, Python's string class, but it's ok.

See, eg. https://en.cppreference.com/w/cpp/string


In my opinion, still a bit hairy, the reason something like nowide exists [1].

1: https://www.boost.org/doc/libs/develop/libs/nowide/doc/html/...


How is the support of unicode in c++?


C++ itself doesn't support it. There are libraries that provide unicode-aware handling of strings/vectors of bytes. It's not always clear that you want unicode-aware code when dealing with unicode, but there are times when it is nice to have.


It's exceedingly verbose, but decent, if you have a recent language version and/or Boost.


Gosh this evoked a keen sense of nostalgia. I don't miss C strings at all. Along with segmentation fault, core dumped. Ruined many a night!


I know. This and the fact that compliers never bitch about a single = in if() statements really take time out of my life...


Most C compilers will give a warning for that. D makes it an error in the grammar. `a < b < c` is also an error in the grammar.


I gave up trying to teach C after 2 years teaching it at university. You’ve been at it, what? 25+ years? Mad props. Thanks for great C/++ compilers, and double-thanks for D!


Always interesting with C posts! Two notes:

- I'm pretty sure public symbols starting with "str" are reserved by the standard.

- Declaring function arguments as const is pretty silly for value types, imo.


Function names starting with "str" followed by a lowercase letter are reserved. So technically "str" itself and "str_" are not.


Doesn’t it stop you modifying them inside the function?


I'm not sure about C but at least in C++ having const on the prototype is meaningless as you can still have the arguments as non-const in the actual definition. Considering that C is usally less strict with these things I'd expect that to be the case there too.


String isn't a value type.


Yes it is, it's a small struct.


It would be a pity not to mention here the Simple Dynamic String (SDS) library made by the maker of Redis : https://github.com/antirez/sds

It is also very well documented. And all you need to embed it in your project is : sds.c sds.h sdsalloc.h

The source code is small and every C99 compiler should deal with it without issues.


It seems that everyone implementing their own string library (including, eh, antirez) thinks masquerading pointers is cute, but in my opinion and experience it's very dangerous because it requires a specific coding convention that can't be checked by compilers. SDS is no exception to this problem:

    sds a = sdsnew("hell");
    sds b = a;
    a = sdscat(a, "o"); // this invalidates b
Masqueraded pointers are inherently linear (or affine if you are pedantic). Any length-changing updates to such pointers can potentially reallocate them, so any value can't be "updated" more than once; values should be consumed and returned by many operations. No typical C types behave like this: primitive values or structs can be updated by assignments and pointers can be updated by dereference. C doesn't support linear types and, while normal pointers do need care, masqueraded pointers need much more care to use correctly. Yes, you can replicate the same bug with normal pointers by replacing the third like to `free(a);`, but you wouldn't expect a bug for non-destructive operations. (Put in the other way, masqueraded pointers make many otherwise non-destructive operations destructive.)

While technically not a string library, this and the strict-aliasing issue for type-generic routines prompted me to write my own small extensible vector library [1] years ago.

[1] https://gist.github.com/lifthrasiir/4422136


Interesting: this library makes use of C generics, so you can str_join a str into a str or into a FILE.


oof bit packing to save a single bit?

https://github.com/maxim2266/str/blob/f4e84657b23977ab3c5cd7...

seems unlikely to matter if you have a bunch of strings flying around...

two features I'd love to see implemented:

- wrapping thread safe tokenization using strtok_r so it's pleasant to tokenize a string

- sprintf-like formatting

anything that improves string handling in C is doing God's work


It's not about saving a bit, but about saving an entire word. If the length and ownership weren't packed together, you'd need one more field in the str struct for the ownership bit, and due to alignment the minimum size increase would be a word.


strtok_r() destructively modifies its input so wrapping it works against this library's objective of using const ponters. It is easy enough to reimplement, though, and I've done this for similar sub-string pointer objects that work without NUL termination.


The existence of strtok_r() is weird anyway. If we can make errno thread-safe, there is no reason why plain strtok() can't be thread-safe. The idea that somebody wants strtok() state shared across threads is just as weird as the idea that errno should be shared across threads.


The C API was never designed with threading in mind. Not all platforms support threads so the old behavior must stay around.


Say what? You like to call plain strtok() from different threads, having them all update a shared global state? I think you missed the point here, because that would be some really evil usage of the strtok() function.

We had "int errno" as global state. We fixed it, in a compatible way, to be thread-safe. Platforms without threads can still implement it the old way if desired.

The same kind of compatible fix could have been done with the strtok() function. There was no need to introduce another function.

Simply: the internal state of strtok() shall be distinct for each thread. (which is trivial if the platform only supports a single thread)


Strings of ASCII characters only? Or can this library be used with Unicode as well? Just asking.


Depends what you mean by Unicode.

All the UTF-8 codepoints can be held inside an 8bit char, which is what this library seems to use under the covers.

You might need to add a couple UTF-specific methods if you want number of graphemes rather than number of bytes, but there's nothing to stop you placing UTF8 data inside a char buffer.


I don’t know what you mean by “all the UTF-8 codepoints can be held inside an 8bit char”. All Unicode codepoints obviously cannot be held in 8-bits. The UTF-8 encoding matches ASCII over the first 7-bits, but that’s not relevant. You can UTF-8 encode Unicode codepoints into a bunch of 8-bit chars, but then you can encode anything you like into a bunch of 8-bit chars; a JPEG file for instance.


> All Unicode codepoints obviously cannot be held in 8-bits.

I mentioned UTF-8 specifically, because the UTF-8 encoding actually does specify this particular feature:

> UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. [0]

[0] https://en.wikipedia.org/wiki/UTF-8


>Unicode using one to four one-byte

UTF-8 is a variable length encoding, some characters need 8 bits, other characters need 32 bits per your own quote.


> one to four one-byte (8-bit) code units

And it is designed so that it fits in one-byte units. That was a central goal of the encoding.

So... If you have a char buffer, like we have been talking about, you can toss any valid UTF-8 sequence inside it.


It’s very rare that any data encoding doesn’t come in a multiple of 8-bits. Have you ever seen a 1.5 byte (12-bit) file? UTF-8 isn’t special in that regard. You can put literally anything into a char buffer and decode it how you like.


The issue isn't whether or not your character encoding is always a multiple of 8 bits. It is whether or not you can use standard (octet-focused) parsing functions to deal with those strings. This is what makes utf-8 "special". No byte of a utf-8 multibyte sequence will ever have a value < 127. So for most "syntactic" parsing problems, you can use standard C functions to deal with utf-8 strings - something that is not true with most other multibyte character encodings.


UTF-8 has an even stronger guarantee: If a byte sequence at any position in a UTF-8 string matches the byte sequence of a UTF-8 encoding of a Unicode code point then that part of the string represents that code point. This means you cannot just use standard C functions like strchr with UTF-8 strings and ASCII characters but you can alos use e.g. strstr to find UTF-8 substrings in UTF-8 strings.


Bytes are bytes. We’re not debating whether it’s easier to write a UTF-8 decoder; I’m asserting that (almost?) any data can be represented as a sequence of bytes and UTF-8 is not special in that regard.


...merely, a[n] is not necessarily the nth charater.

Up to you to make sense of the data in singles, pairs, fours, and before that to declare the buffer as the appropriate multiplier plus 1.


And what good is that? All computer memory is just an array of addressable bytes. So all you've said is you can store UTF-8 strings in memory. You still can't do random access on the string (ie. s[i] will not give you the ith character).


For any definition of a character that is useful to anyone except a text shaping engine, neither will s[i] with UCS-4 (and definitely not with UTF-16).


Erm... ASCII? The original comment in this thread was asking whether this supports anything but ASCII.


You can fit 32 bits inside 4 8 bit chars...


The most valuable property of UTF-8 from a C-string point of view is that it guarantees there are no embedded NULs in a UTF-8 string.

If you naively put the bytes of UTF-16 or UTF-32 encodings into a buffer, they might contain NUL (zero) byte values. Which, for C strings, means "end of string". UTF-8 makes sure this doesn't happen, which makes it compatible with existing C string functions.


See my link in sibling comment; the library supports decoding arbitrary encodings to unicode, even those with embedded NULLs.


There is some support for reading a string as an encoded in the current program locale. See: https://github.com/maxim2266/str#unicode-support



> This is the good old C language, not C++ or Rust, so nothing can be enforced on the language level, and certain discipline is required to make sure there is no corrupt or leaked memory resulting from using this library.

I loved this. “This ain’t your fancy schmancy Tesla, it’s granddaddy’s old Ford pickup”.


I loved that

> Disclaimer: This is the good old C language, not C++ or Rust, so nothing can be enforced on the language level, and certain discipline is required to make sure there is no corrupt or leaked memory resulting from using this library.


I'm a bit surprised why C11 support is needed. When you write a library like this, you usually aim for compatibility. There is a lot of ANSI C code around, including popular projects like SQLite. Yet I don't really see much of C11 features in this code except C++-style comments and inline functions that could be solved with simple #ifdefs.


C++ comments and inline functions arrived with C99, which was 21 years ago, not C11. With C11, itself now an obsolete standard from 9 years ago, we get the _Generic keyword.

Avoiding the _Generic keyword is difficult. One might try using the sizeof operator.


Ownership is nice, but how can a zero terminated buffer library still call itself string? We have unicode for a while. UTF-8 only, not less. Strings must support unicode.

This impacts sort and comparisons mostly. But without cmp you cannot search in strings.


UTF-8 strings can be compared with strcmp(), you just can't get alphabetical sorting out of it. Most other str*() functions also work with UTF-8 encoded strings, you just need to know what to expect (e.g. splitting with strtok() works as long as the delimiters are all 7-bit ASCII chars, etc...).


> UTF-8 strings can be compared with strcmp()

No, they can't? These two UTF-8 byte sequences in a C char pointer,

  c3 a9 00
  65 cc 81 00
Represent the same string, but do not compare equal with strcmp.

And it's not just that; you've noted how strtok will break down. strchr() can't be used w/ a non-ASCII needle, there is no support for code units, etc.


These two strings will be equal after normalization and validation of UTF-8.


That's a problem with the UNICODE standardization process, and not a problem with the UTF-8 encoding though.


> you just can't get alphabetical sorting out of it

Lexicographical sorting over UTF-8 strings is actually the same as lexicographical sorting over the corresponding Unicode code point sequence.


AFAIK, comparison is language dependent, but how do you tell the string's language and how do you compare strings from different languages and multilingual strings?


There is the Unicode Collation Algorithm standard to address this. An example of its implementation is the utf8_unicode_ci collation in MySQL.


> zero terminated buffer library

Indeed.


Maybe I'm missing something but it doesn't look like memory allocation was carefully considered in this library. For example, there are no custom allocators?


Looks useful. Should make it single-header / single-file too. Regex would also be good to have.


What about support string interning, any plans?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: