
Simple Dynamic Strings library for C, compatible with null-terminated strings - pcr910303
https://github.com/antirez/sds
======
antirez
Hi, author here. May make sense to make SDS in perspective given a few
comments I'm reading here.

1\. Yep, more than "strings" SDS may be consider a library for dynamic
buffers, especially from people coming from C++ or higher level languages.
However I think that for C, it makes sense to provide a very low level thing
like that.

2\. In practice, if you see how SDS is used (extensively) inside Redis, it
normally models things where you would do realloc magics, pointer math, and so
forth: from client output buffers, to actual strings to accumulate error
messages, to the Redis String object itself.

3\. Probably an UTF-32 layer could be implemented on top of that, but I would
think the two modules completely separated from the POV of the implementation.
Like having an additional utf32.c file that provides additional interfaces on
top of SDS.

4\. The peculiar approach used by SDS header-before-pointer creates some
problem with Valgrind and similar tools, however Valgrind will just report
"possibly lost", that are messages mostly safe to ignore. However the
advantage of using SDS strings directly with C function libraries is very
handy.

SDS strings are not perfect and tend to be extremely optimized for Redis, if
you look at the API, they allow to do things that are very low level, like
pre-allocating the internal buffers to improve performances when you know you
are going to read a big chunk of data from a socket or alike. However while
imperfect and very tuned, these kind of libraries show how much you can easily
improve C, with little work, and how many unsafe things in C are about lack of
abstractions.

The value of SDS is "less is more" in the simple API they provide that
pretends that SDSs are just plain-strings++. You can see this in a few
features: plain C pointers interface, and the policy of always terminate the
string. They need more love anyway, and to be less specialized for Redis,
adding more useful APIs. Maybe at some point I'll find the time.

~~~
poiuyt098
> show how much you can easily improve C, with little work, and how many
> unsafe things in C are about lack of abstractions

C has always badly needed built-in strings (and arrays with size info,
generally).

To save a byte, C designers committed The Most Expensive One-byte Mistake
[https://queue.acm.org/detail.cfm?id=2010365](https://queue.acm.org/detail.cfm?id=2010365)

~~~
m463
A shortcoming (or outright failure) of C is the lack of an extensive
"batteries included" library.

Maybe not C + Knuth, but some way of portably advancing the language over
time.

~~~
ndesaulniers
Why? So we could be stuck with horrible interfaces forever (most of string.h
comes to mind)? No thanks.

~~~
m463
don't you think it could evolve?

------
pingyong
Just looking at the API "sds" seems to just be a typedef for char* -
unfortunately, that means that accidentally passing a char* as an sds into any
of the functions will be instant UB and not even a compiler warning.

Considering this is C, there is no way to prevent this easily since you can't
express a type that is one-way convertible (i.e. sds -> char* ok, char* -> sds
not ok).

I do have to wonder though if avoiding a couple of .str's is really worth the
risk. You definitely have to always keep in mind to never accidentally mix a
char* with an sds, for the advantage of slightly cleaner looking code.

~~~
antirez
One of the main points of the library is the ability to pass SDSs where char*
is expected without doing anything. So this is a "feature" in the author's
spirit.

~~~
electrograv
As another reply has suggested, you can design the wrapper struct so you can
simply write:

    
    
      legacy_fn(my_text.s);
    

Versus the current:

    
    
      legacy_fn(my_text);
    

I don’t think saving two characters per legacy function call is even remotely
worth the loss of static type safety (which risks serious memory corruption
and/or security holes, which are entirely preventable at compile-time in this
way).

In fact, I even find the explicitness _more_ pleasantly and clearly readable:
I like being able to know at-a-glance when types are changing, especially in a
language as unsafe as C.

Lastly, if you can tolerate just using some of C++‘s features, you can define
a no-compromise solution: A type (still represented by a single pointer under-
the-hood) that will implicitly convert (with zero runtime cost) into a C
string _but not vice versa._

~~~
jhallenworld
A main problem with C++ is that it uses .c_str() instead of something nice
like .s for this.

------
jhallenworld
My version of this (from 1992!) is here:

[https://sourceforge.net/p/joe-
editor/mercurial/ci/default/tr...](https://sourceforge.net/p/joe-
editor/mercurial/ci/default/tree/joe/vs.h)

I also have NULL terminated arrays of strings, like arguments lists:

[https://sourceforge.net/p/joe-
editor/mercurial/ci/default/tr...](https://sourceforge.net/p/joe-
editor/mercurial/ci/default/tree/joe/va.h)

It's set up so that you could make a dynamic arrays of any types by copying
the header and source files, but changing a few constants and providing
comparison and duplication functions for the elements involved. It's like
manual template instantiation.

Of course this is prone to memory leaks, same as sds. But there is branch
here:

[https://sourceforge.net/p/joe-
editor/mercurial/ci/coroutine/...](https://sourceforge.net/p/joe-
editor/mercurial/ci/coroutine/tree/joe/obj.h)

In this version, all strings are allocated on an obstack. Space for temporary
strings is automatically reclaimed when you return to the top level.

You can mark a string as permanent, then it will not be reclaimed, and instead
has to be explicitly freed. So you can still have memory leaks but less
likely, and also you could have accidental automatic freeing, but a lot of
explicit frees are eliminated.

C++ strings are better... except that I have my own library for them also
because I hate not being able to return NULL to indicate a failure. This is
called the semi-predicate capability of C strings, and is easy to have in C++
with a different library.

------
mehrdadn
Looks like a reinvention of BSTR? Using char instead of wchar_t.
[https://docs.microsoft.com/en-us/previous-
versions/windows/d...](https://docs.microsoft.com/en-us/previous-
versions/windows/desktop/automat/bstr)

------
faragon
I wrote a library with similar functionality, also with variable-size header
(smaller header for small strings, and bigger when growing). With both heap
and stack allocation support (contiguous memory for headers and data), Unicode
interoperability, and even data compression. Eventually I added support for
other data types (vector, map, hash map, set, hash set, bit set). The
SDS/SDS-2 is more suited for production, and this is not for recommending mine
instead, but if someone wants to check a different implementation looking for
ideas (BSD licensed, too):

[https://github.com/faragon/libsrt](https://github.com/faragon/libsrt)

------
yrro
Dovecot's dynamic string library facilities are pretty cool.

[https://wiki.dovecot.org/Design/Strings](https://wiki.dovecot.org/Design/Strings)

------
salgernon
An unpopular opinion: just use the C++ as a better C and get firm static
typechecking against this 'sds' type. And I personally would rather include
const char* getUTF8StringPtr(sds) rather than an operator overload too.

When I have a major refactoring project, I typically will try to compile it
with C++ first to help catch all the weird wobbly bits like this.

[EDIT: I now see in the comments other people with a similar opinion, so maybe
not as unpopular as I would've thought!]

------
azhenley
I rarely work with C but when I do I am always reminded at how painful it is
to use strings.

Is this a complete drop-in replacement?? If so it could occasionally make my
life much easier!

~~~
tsegratis
This alternative (by me) covers char _, int_ , etc with a compiletime inlined
generic api much the same way SDS does strings

[https://tse.gratis/aArray/](https://tse.gratis/aArray/)

aStr("string"); then aMap aFold aConcat, etc, but aAppend(&array,'\0') needs
to be done manually if passing to standard str functions. This is so long* or
whatever arrays can have the same interface

~~~
cmrdporcupine
Nice. I wrote myself up a quick C++ wrapper just now. Will play with this.
Intend to use on embedded/bare-metal systems with no C stdlib, so will have to
rip out the printfs and the like.

~~~
tsegratis
Wow, sounds neat! Look forward to seeing what you create

------
adrianmonk
> _This is achieved using an alternative design in which instead of using a C
> structure to represent a string, we use a binary prefix that is stored
> before the actual pointer to the string that is returned by SDS to the
> user._

I'm not convinced of the advantage in returning a char* to the user.

I see how it's convenient to be able to use all the built-in and other
functions that accept a char* as an argument, like printf() or your favorite
logging library.

BUT, you can only use the ones that treat the string as read-only. (You can't
use strncat(), etc.) And you have no protection against messing this up.

Seems like a better trade-off would be to have a user-visible type that isn't
char _, then a user-visible function that converts that to char_ when you need
it.

So instead of this:

    
    
        /* sds is just a typedef to char* */
        sds mystring = sdsnew("Hello World!");
        printf("%s\n", mystring);
        sdsfree(mystring);
    

You'd have a function like this:

    
    
        /* get read-only C-style string from an sds */
        const char *sdsC(const sds s);
    

And code like this:

    
    
        /* sds is its own distinct type, not another name for char* */
        sds mystring = sdsnew("Hello World!");
        printf("%s\n", sdsC(mystring));
        sdsfree(mystring);
    

Yes, it's more keystrokes, but surely the safety is worth it considering it is
only needed when bridging a compatibility gap. (Also, possibly the sds
functions could be a tiny bit more efficient if they aren't always doing
conversions on their arguments.)

(I do like the idea of putting header and characters into one struct. That's
probably good for efficiency compared to a struct that points to a buffer and
gives the system a layer of pointer indirection to go through.)

------
paulsmith
There is also Better String Library, similar self-contained library with
C-style string compatibility:

[http://bstring.sourceforge.net/](http://bstring.sourceforge.net/)

~~~
aidenn0
TFA specifically names what is different between SDS and libraries like
bstring:

> Normally dynamic string libraries for C are implemented using a structure
> that defines the string. The structure has a pointer field that is managed
> by the string function, so it looks like this:
    
    
        struct yourAverageStringLibrary {
            char *buf;
            size_t len;
            ... possibly more fields here ...
        };
    

> SDS strings as already mentioned don't follow this schema, and are instead a
> single allocation with a prefix that lives before the address actually
> returned for the string.

~~~
dho85
What the page doesn't really acknowledge is that its "single allocation with a
prefix" design is significantly more dangerous than the "separate struct"
design. Particularly in that it's incompatible with the address sanitizer.

------
beders
Ah, juicy new ways to corrupt my stack :)

Joking aside, I assume this is char set agnostic?

------
gpvos
I tend to use libdjb[0] if I need to use strings in C. Is this library
significantly easier?

[0] [http://www.fefe.de/djb/](http://www.fefe.de/djb/)

~~~
gpderetta
from the top of that page:

    
    
      (Note: This has not been touched since 2000, use [2] instead) 
    

[1] [https://www.fefe.de/libowfat/](https://www.fefe.de/libowfat/)

~~~
gpvos
Yeah, I may have downloaded it from somewhere else, or just grabbed the string
routines out of qmail or some other djb code. I tend to trust djb's code more
than someone else's reimplementation of the same interface, which is what
libowfat is since it is GPL. Anyway, it was several years ago already that I
last did so (but later than 2000).

------
reza_n
This is a bit on the unsafe side since it blindly trusts user input. At
minimum, there needs to be some kind of magic number in the struct header to
validate its looking at the right memory. Best case, some kind of pointer
accounting. Unfortunately, magic doesn't come for free.

~~~
vardump
We can optimize that, just put a function pointer in the beginning that's
called every time to validate the string... /s

------
Animats
Very nice.

Should have been done 30 years ago.

------
wnoise
Would this work with the Boehm Garbage Collector given the "pointer to the
middle of the object" trick?

~~~
aidenn0
The bdwgc has a #define to set whether or not it should check for interior
pointers.

------
cocoa19
Any thoughts on using glib vs sds for string manipulations?

------
Keyframe
UTF support?

~~~
rurban
Exactly. You should not name a buffer lib "string", when it does not support
the basic unicode operations: case fold, normalize => compare, search. In
utf-8 of course.

I'm also missing stack allocation support, needed for fast short strings. It
should be even included in sdsnew, for len < 128.

~~~
Keyframe
Unfortunately, yes. Limited usefulness at best without unicode support, at
least to a degree. Even UTF-16 or 32 internally would suffice, treating UTF-8
only as ser/de format is good enough these days.

~~~
jstimpfle
UTF-8 is the preferred internal storage format for most applications. The
reason is space efficiency.

~~~
cardiffspaceman
I certainly have leaned on UTF-8 but I wonder how efficient it is for the
numerically-higher code points if the language is not heavily cp1252, like
Korean.

~~~
pstch
An interesting - but not surprising - thing about this is that compression
algorithms can be more efficient on wider representations of numerically-high
code points (e.g, for some Korean corpus, using UTF-32 instead of UTF-8
improves LZMA compression by ~10%).

~~~
zzo38computer
How well does that corpus compress with LZMA if using a Korean specific
character code (such as EUC-KR)? And what about other combinations, with other
character codings and other compression algorithms?

~~~
pstch
EUC-KR doesn't improve much with LZMA (2% over UTF-16), but is better with
gzip-9 (10% over UTF-16). I haven't studied this extensively, just did a few
tests when waiting for it to download.

