Hacker News new | past | comments | ask | show | jobs | submit login
Simple Dynamic Strings library for C, compatible with null-terminated strings (github.com)
105 points by pcr910303 3 days ago | hide | past | web | favorite | 83 comments





Hi, author here. May make sense to make SDS in perspective given a few comments I'm reading here.

1. Yep, more than "strings" SDS may be consider a library for dynamic buffers, especially from people coming from C++ or higher level languages. However I think that for C, it makes sense to provide a very low level thing like that.

2. In practice, if you see how SDS is used (extensively) inside Redis, it normally models things where you would do realloc magics, pointer math, and so forth: from client output buffers, to actual strings to accumulate error messages, to the Redis String object itself.

3. Probably an UTF-32 layer could be implemented on top of that, but I would think the two modules completely separated from the POV of the implementation. Like having an additional utf32.c file that provides additional interfaces on top of SDS.

4. The peculiar approach used by SDS header-before-pointer creates some problem with Valgrind and similar tools, however Valgrind will just report "possibly lost", that are messages mostly safe to ignore. However the advantage of using SDS strings directly with C function libraries is very handy.

SDS strings are not perfect and tend to be extremely optimized for Redis, if you look at the API, they allow to do things that are very low level, like pre-allocating the internal buffers to improve performances when you know you are going to read a big chunk of data from a socket or alike. However while imperfect and very tuned, these kind of libraries show how much you can easily improve C, with little work, and how many unsafe things in C are about lack of abstractions.

The value of SDS is "less is more" in the simple API they provide that pretends that SDSs are just plain-strings++. You can see this in a few features: plain C pointers interface, and the policy of always terminate the string. They need more love anyway, and to be less specialized for Redis, adding more useful APIs. Maybe at some point I'll find the time.


> show how much you can easily improve C, with little work, and how many unsafe things in C are about lack of abstractions

C has always badly needed built-in strings (and arrays with size info, generally).

To save a byte, C designers committed The Most Expensive One-byte Mistake https://queue.acm.org/detail.cfm?id=2010365


Thanks for the link. Great analysis of a real sliding-doors moment.

A shortcoming (or outright failure) of C is the lack of an extensive "batteries included" library.

Maybe not C + Knuth, but some way of portably advancing the language over time.


Why? So we could be stuck with horrible interfaces forever (most of string.h comes to mind)? No thanks.

don't you think it could evolve?

glib

That would be a start, but my though is more closely related to the C standard, not an add-on like boost.

Whenever I've used functionality like this before, the inevitable problem is the string ends up with nulls in it, meaning C and the library disagree on the length of the string, which inevitably causes problems and corruption.

How do you handle this?


You use sdslen() instead of strlen(). Ditto for everything else. Failure to do so results in subtle bugs.

Then (in my opinion) the library shouldn't advertise itself as compatible with Null terminated C strings, as users will assume that means they can use strlen (or more likely, other functions which use strlen themselves).

In a similar system I work on, we have a list of functions (like strlen, Strcat), where we carefully vet every occurrence in the code base. It's annoying but the only way to stop subtle bugs we've found.


Redis doesn’t really have this problem because its mostly developed by Antirez.

Then why bother making these strings superficially compatible with C strings? Just printf or am I missing something?

The same reason std::string has the c_str() method. It’s convenient.

> Probably an UTF-32 layer could be implemented on top of that

UTF-32 is, in all but extreme niche cases, a supremely wasteful and inefficient way to store and process strings, even in light of the cost of variable-length codepoints. The longest UTF-8 sequence is now four bytes anyway, so UTF-8 is never ever less compact than UTF-32. The only conceivable practical use of UTF-32 is a case where the upper 11 bits are used to tag characters with metadata, but in that case you'll have to mask those to display them anyway.

In practice, if your application must be aware of codepoints, it most likely should be aware of extended grapheme clusters as well, and those have an arbitrary number of codepoints in them.

UTF-16 is closer to being useful in some locales, but the marginal benefit (one byte less per BMP CJK codepoint) is, in my view, outweighed by the incompatibility with ASCII. Even a small number of ASCII strings in a given transmission will wipe out any advantage UTF-16 could have in CJK locales, and most transmissions these days have enough ASCII to do so. For example, today's Japanese Wikipedia main page is 102,620 bytes in UTF-8, and 182,480 bytes in UTF-16, despite most of the visible graphemes being CJK characters (which require three bytes in UTF-8). The Korean Wikipedia main page has larger text, and less of it, so the difference is even more stark (81,072 bytes in UTF-8, and 146,630 bytes in UTF-16). Even after aggressive gzip compression (with zopfli), text-heavy UTF-16 Japanese and Korean HTML documents are about 10% larger than their UTF-8 equivalent; and this is before you've transmitted any CSS or JavaScript.


With UTF16, simple string handling code runs faster.

When you're iterating over characters, with UTF8 branch prediction mispredicts all the time. Characters in UTF-8 have random length from 1 to 3 bytes. This might be good for bandwidth, but the tradeoff is slower processing, all modern CPUs have branch prediction hardware, and deep pipelines.

With UTF-16 CPUs predict these branches all the time, surrogate pairs are extremely rare, the rest of languages are 2 bytes/character.

While it's technically possible to optimize UTF-8 with clever programming like manual SIMD, that code gonna be much more complex than just a while or for loop, i.e. more expensive to develop and harder to support.

I think that's the reason why all platforms with rich GUI ecosystem (Windows, OSX, iOS, Android, JavaScript) use UTF-16 almost exclusively.


Utf-32 allows for constant time indexing of a codepoint which may be important. Utf-16 is the worst of all worlds with its bloat and surrogate pair nonsense

Just looking at the API "sds" seems to just be a typedef for char* - unfortunately, that means that accidentally passing a char* as an sds into any of the functions will be instant UB and not even a compiler warning.

Considering this is C, there is no way to prevent this easily since you can't express a type that is one-way convertible (i.e. sds -> char* ok, char* -> sds not ok).

I do have to wonder though if avoiding a couple of .str's is really worth the risk. You definitely have to always keep in mind to never accidentally mix a char* with an sds, for the advantage of slightly cleaner looking code.


One of the main points of the library is the ability to pass SDSs where char* is expected without doing anything. So this is a "feature" in the author's spirit.

As another reply has suggested, you can design the wrapper struct so you can simply write:

  legacy_fn(my_text.s);
Versus the current:

  legacy_fn(my_text);
I don’t think saving two characters per legacy function call is even remotely worth the loss of static type safety (which risks serious memory corruption and/or security holes, which are entirely preventable at compile-time in this way).

In fact, I even find the explicitness more pleasantly and clearly readable: I like being able to know at-a-glance when types are changing, especially in a language as unsafe as C.

Lastly, if you can tolerate just using some of C++‘s features, you can define a no-compromise solution: A type (still represented by a single pointer under-the-hood) that will implicitly convert (with zero runtime cost) into a C string but not vice versa.


A main problem with C++ is that it uses .c_str() instead of something nice like .s for this.

What you do is something like the following:

    typedef struct sds_s {
        char data[0];
    } sds;
Which makes them essentially equivalent, but not from a type perspective. Then to handle conversions, you add explicit functions a la

    char* sds_cstr(sds *str);
Then you have full type checking to help you (and an additional advantage if your data format changes).

The downside to using a non-inlined conversion function is that (1) it will be slower than just accessing the member directly, and (2) it’s much more verbose.

Why not just use the member explicitly? This way, converting a STS string into a legacy C string could be as simple as writing “.cstr” (or if you just be as terse as possible, it could be defined as “.s” as another poster suggests).

In this case, compromise of increased code verbosity is extremely minor at worst (just a few characters). At best, this extra explicitness can actually be seen as a good thing for code readability (not to mention the huge benefits we’re discussing of type safety).

So the question then is: Why doesn’t SDS do this? The actual library uses a regular C typedef (which is unsafe for the reasons described above).


I think you're missing the point here, the conversion function is trivially inlinable:

    char* sds_cstr(sds *str){
        return &str[0]; //you could probably just cast too
    }
The other posters solution is actually not equivalent in this way (his sds is convertable to char* instead of sds*).

Even an inlined function only has value if our goal is to open the door to implementation changes in the future. But because the current implementation is zero-cost (an inlined no-op), any behavioral change will necessary compromise performance.

Otherwise (if we want to retain the zero-cost guarantee of converting to a C string), directly accessing the member variable is: functionally equivalent, simpler, more concise, more readable, more explicit.


I actually prefer the inline function here (while making the member char* private when compiling as C++), since that mostly prevents assignment to the member variable in settings where that would not be a valid operation.

With that said, I agree that the additional verbosity is a significant drawback, and I can understand going with the slightly-less-safe option for that reason alone.


Ah, that’s a good point. I’m a fan of extra safety, though we’d need to make a few extra changes to get around C’s lack of private variables: We could make the internal member instead a size_t, and cast to pointer form when needed inside the library or when converting via this inline function (which is still a no-op in terms of performance). The only major downside then remaining is perhaps syntax/verbosity, which IMO is secondary to safety.

However: If we open the discussion up to C++, then this is pretty easy to solve without any real syntactic or performance compromise: we can make a class/struct which defines implicit conversions to C strings, and disables any implicit conversions from C strings.


Good point about implicit conversions; I may actually be able to use that to clean up some code in the near future.

My preferred approach for handling this type of scenario is to write C code that is valid C++, while adding a bunch of C++-specific mostly-compile-time safety checks which can't be expressed in pure C. It's not perfect, but it has helped me reduce the impact of footguns without sacrificing the interoperability advantages of C.

(And as other commenters have noted, wrapping the char* in a struct addresses the immediate footgun in question.)


The right way to do this would be:

  typedef struct sds {
    char *s;
  } sds;
Now you get type safety: you can't accidentally pass a plain 'ole C string to an sds function. But you do need to access the 's' field to get the C string for passing to non-sds functions.

This is fine, but it somewhat defeats the purpose of the opaquely prefixed header. I think the idea is just to be able to treat an SDS like a normal string. Unfortunately, there's no way to have that property not be commutative in C.

Yeah, but almost perfect ergonomics is good enough, and type safety >> ergonomics.

It is not a real issue in modern days. You just enable address sanitizer in all your debug builds and these issues are obvious at runtime (still not perfect, but an improvement from dark days).

Address sanitizer can only catch errors on branches your debug build exercises.

For redis, I think given the code coverage on test cases, that shouldn't be a concern. TBH, I think any C code that doesn't have a coverage is not a functioning code. Too much bugs can only be caught at runtime (thankful to the sanitizers-family / valgrind).

> Considering this is C, there is no way to prevent this easily since you can't express a type that is one-way convertible (i.e. sds -> char* ok, char* -> sds not ok).

    typedef struct {
        char *s;
    } sds_t;
will produce an error if you try to pass a char* instead. http://codepad.org/CmrskN9n

You can "accidentally" do a lot of things that any compiler in the world won't warn you about. This isn't an excuse to not use a good library/tool.

My version of this (from 1992!) is here:

https://sourceforge.net/p/joe-editor/mercurial/ci/default/tr...

I also have NULL terminated arrays of strings, like arguments lists:

https://sourceforge.net/p/joe-editor/mercurial/ci/default/tr...

It's set up so that you could make a dynamic arrays of any types by copying the header and source files, but changing a few constants and providing comparison and duplication functions for the elements involved. It's like manual template instantiation.

Of course this is prone to memory leaks, same as sds. But there is branch here:

https://sourceforge.net/p/joe-editor/mercurial/ci/coroutine/...

In this version, all strings are allocated on an obstack. Space for temporary strings is automatically reclaimed when you return to the top level.

You can mark a string as permanent, then it will not be reclaimed, and instead has to be explicitly freed. So you can still have memory leaks but less likely, and also you could have accidental automatic freeing, but a lot of explicit frees are eliminated.

C++ strings are better... except that I have my own library for them also because I hate not being able to return NULL to indicate a failure. This is called the semi-predicate capability of C strings, and is easy to have in C++ with a different library.


Looks like a reinvention of BSTR? Using char instead of wchar_t. https://docs.microsoft.com/en-us/previous-versions/windows/d...

An unpopular opinion: just use the C++ as a better C and get firm static typechecking against this 'sds' type. And I personally would rather include const char* getUTF8StringPtr(sds) rather than an operator overload too.

When I have a major refactoring project, I typically will try to compile it with C++ first to help catch all the weird wobbly bits like this.

[EDIT: I now see in the comments other people with a similar opinion, so maybe not as unpopular as I would've thought!]


I wrote a library with similar functionality, also with variable-size header (smaller header for small strings, and bigger when growing). With both heap and stack allocation support (contiguous memory for headers and data), Unicode interoperability, and even data compression. Eventually I added support for other data types (vector, map, hash map, set, hash set, bit set). The SDS/SDS-2 is more suited for production, and this is not for recommending mine instead, but if someone wants to check a different implementation looking for ideas (BSD licensed, too):

https://github.com/faragon/libsrt


I rarely work with C but when I do I am always reminded at how painful it is to use strings.

Is this a complete drop-in replacement?? If so it could occasionally make my life much easier!


I don't find it painful at all. You need to think about the lifetime of your strings. The lifetime is usually convenient to split into two (consecutive) phases: the first phase is when the string gets constructed / modified, and the second phase is the immutable phase up until the string is released again. You'll often have separate string types for these phases: A string builder type for the mutable phase, and a simpler immutable type for the immutable phase.

Often the former will need just a single shared allocation (per thread). That works if you're only ever constructing a single string at a time.

Often the latter won't be dealing with memory management at all. You will often represent it by a plain char pointer and optionally a length field. Management-wise, at most you will need to construct it initially by making an immutable string from the mutable one, and a call to a release function when the string is no longer needed.

The fact that the string isn't modified in the immutable phase makes it trivial to pass it around as an ordinary char pointer (plus optional length), which is as convenient as it gets.

In rare cases you might want to make a hashmap for string interning. I've used interning in a past compiler project, where it was used to collapse identifier strings.

That will probably cover more than 95% of all your string needs. And it's very easy to be vastly more efficient than any from-the-shelf GC'ed string type with this. All you need to invest is to write a little near-boilerplatey code, but it's far from "hard".


This alternative (by me) covers char, int, etc with a compiletime inlined generic api much the same way SDS does strings

https://tse.gratis/aArray/

aStr("string"); then aMap aFold aConcat, etc, but aAppend(&array,'\0') needs to be done manually if passing to standard str functions. This is so long* or whatever arrays can have the same interface


Nice. I wrote myself up a quick C++ wrapper just now. Will play with this. Intend to use on embedded/bare-metal systems with no C stdlib, so will have to rip out the printfs and the like.

Wow, sounds neat! Look forward to seeing what you create

SDS works by allocating a chunk of memory that can fit a header + your string data. The APIs all consume pointers and internally it mutates it's copy of the pointer so it can get at the metadata. So it can be used to generate strings that are passed to other things. I don't think it's easy to use sds operations on other strings in a zero-copy way.

Reminds me of libowfat [0]'s stralloc/buffer. It's used in gatling[1], a small (LOC/binary) http/ftp/samba server simple nonblocking IO.

[0]: https://www.fefe.de/libowfat/ [1]: https://www.fefe.de/gatling/


You might look at the Linux version of CFLite, the Apple reference counted C library. Has other things than just strings which would be useful to a C programmer.

https://github.com/fjolnir/CoreFoundation-Lite-Linux


> This is achieved using an alternative design in which instead of using a C structure to represent a string, we use a binary prefix that is stored before the actual pointer to the string that is returned by SDS to the user.

I'm not convinced of the advantage in returning a char* to the user.

I see how it's convenient to be able to use all the built-in and other functions that accept a char* as an argument, like printf() or your favorite logging library.

BUT, you can only use the ones that treat the string as read-only. (You can't use strncat(), etc.) And you have no protection against messing this up.

Seems like a better trade-off would be to have a user-visible type that isn't char, then a user-visible function that converts that to char when you need it.

So instead of this:

    /* sds is just a typedef to char* */
    sds mystring = sdsnew("Hello World!");
    printf("%s\n", mystring);
    sdsfree(mystring);
You'd have a function like this:

    /* get read-only C-style string from an sds */
    const char *sdsC(const sds s);
And code like this:

    /* sds is its own distinct type, not another name for char* */
    sds mystring = sdsnew("Hello World!");
    printf("%s\n", sdsC(mystring));
    sdsfree(mystring);
Yes, it's more keystrokes, but surely the safety is worth it considering it is only needed when bridging a compatibility gap. (Also, possibly the sds functions could be a tiny bit more efficient if they aren't always doing conversions on their arguments.)

(I do like the idea of putting header and characters into one struct. That's probably good for efficiency compared to a struct that points to a buffer and gives the system a layer of pointer indirection to go through.)


There is also Better String Library, similar self-contained library with C-style string compatibility:

http://bstring.sourceforge.net/


TFA specifically names what is different between SDS and libraries like bstring:

> Normally dynamic string libraries for C are implemented using a structure that defines the string. The structure has a pointer field that is managed by the string function, so it looks like this:

    struct yourAverageStringLibrary {
        char *buf;
        size_t len;
        ... possibly more fields here ...
    };
> SDS strings as already mentioned don't follow this schema, and are instead a single allocation with a prefix that lives before the address actually returned for the string.

What the page doesn't really acknowledge is that its "single allocation with a prefix" design is significantly more dangerous than the "separate struct" design. Particularly in that it's incompatible with the address sanitizer.

Ah, juicy new ways to corrupt my stack :)

Joking aside, I assume this is char set agnostic?


Dovecot's dynamic string library facilities are pretty cool.

https://wiki.dovecot.org/Design/Strings


I tend to use libdjb[0] if I need to use strings in C. Is this library significantly easier?

[0] http://www.fefe.de/djb/


from the top of that page:

  (Note: This has not been touched since 2000, use [2] instead) 
[1] https://www.fefe.de/libowfat/

Yeah, I may have downloaded it from somewhere else, or just grabbed the string routines out of qmail or some other djb code. I tend to trust djb's code more than someone else's reimplementation of the same interface, which is what libowfat is since it is GPL. Anyway, it was several years ago already that I last did so (but later than 2000).

Also, in general, you can expect that if someone links to a site, they will have read the first few lines of that page, and the people who later follow that link will read them too.

This is a bit on the unsafe side since it blindly trusts user input. At minimum, there needs to be some kind of magic number in the struct header to validate its looking at the right memory. Best case, some kind of pointer accounting. Unfortunately, magic doesn't come for free.

We can optimize that, just put a function pointer in the beginning that's called every time to validate the string... /s

Would this work with the Boehm Garbage Collector given the "pointer to the middle of the object" trick?

The bdwgc has a #define to set whether or not it should check for interior pointers.

Very nice.

Should have been done 30 years ago.


Any thoughts on using glib vs sds for string manipulations?

UTF support?

Should just use the ICU string library. You get utf support and much more that is needed for modern string usage. https://unicode-org.github.io/icu-docs/apidoc/released/icu4c...

Exactly. You should not name a buffer lib "string", when it does not support the basic unicode operations: case fold, normalize => compare, search. In utf-8 of course.

I'm also missing stack allocation support, needed for fast short strings. It should be even included in sdsnew, for len < 128.


For stack allocation might as well just make a fixed sized char array then. sds by its design seems meant for heap strings (d in the name means dynamic).

How do you implement stack allocation support inside a C library function? How is de-allocation expressed?

alloca()?

No. Quoting the manual page:

The alloca() function allocates size bytes of space in the stack frame of the caller. This temporary space is automatically freed when the function that called alloca() returns to its caller.

That would not work, since the code calling alloca() is the library, and you want the to allocate on the stack frame of the function calling the library. I don't think that is possible, unless you make the library function into a macro (perhaps using ?: on the allocation size).


For C string library with stack allocation and some Unicode support you can check: https://github.com/faragon/libsrt

Unfortunately, yes. Limited usefulness at best without unicode support, at least to a degree. Even UTF-16 or 32 internally would suffice, treating UTF-8 only as ser/de format is good enough these days.

I don't get entirely why people want or expect to store UTF-32 in memory for any substantial period. It makes more sense to me as: if you need to process codepoint at a time, parse from UTF-8 or UTF-16 in a codepoint-at-a-time fashion, leaving only ~1 codepoint decoded at a time.

People seem to think that 1 codepoint = 1 integer frees you from thinking Unicode is hard. But 1 codepoint is not 1 glyph. You have combining characters, zero-width joiners [used also in emojis], RTL markers, Han unification, probably more. So you can't really think of a Unicode string as a random-access, one-glyph-per-unit type of thing in any encoding.

So I hope when people say "UTF-32 support" they mean "decode UTF-8".


The Unicode functions of Glk use UTF-32 (and numbers outside of Unicode range are used for function keys), and Glulx supports strings of 32-bit characters (usually Unicode, although this is not required) (strings of 8-bit characters are also supported, but this doesn't support UTF-8 or any other multibyte encoding). (However, strings in Glulx are usually huffed anyways, so other than the temporary buffers you would ordinarily store a variable number of bits per character (or sometimes a string of several characters has its own code), and can be shorter than using UTF-8 or UTF-32 or other uncompressed encodings.)

The stuff you mention does make Unicode very messy though.


UTF-8 is the preferred internal storage format for most applications. The reason is space efficiency.

That and the perceived simplicity of UTF16/32 is not actually the case so why bother if you have to do it the hard way anyway.

I certainly have leaned on UTF-8 but I wonder how efficient it is for the numerically-higher code points if the language is not heavily cp1252, like Korean.

An interesting - but not surprising - thing about this is that compression algorithms can be more efficient on wider representations of numerically-high code points (e.g, for some Korean corpus, using UTF-32 instead of UTF-8 improves LZMA compression by ~10%).

How well does that corpus compress with LZMA if using a Korean specific character code (such as EUC-KR)? And what about other combinations, with other character codings and other compression algorithms?

EUC-KR doesn't improve much with LZMA (2% over UTF-16), but is better with gzip-9 (10% over UTF-16). I haven't studied this extensively, just did a few tests when waiting for it to download.

As much as C strings support it.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: