
Simple Dynamic Strings library for C - antirez
https://github.com/antirez/sds
======
dmunoz
I think the library is great, and don't have a whole lot to say about it, but
wanted to mention one tangentially related thing:

The README on this repo is awesome. Opening up with advantages and
disadvantages? Awesome. Plenty of code examples covering all of the major use
cases? Awesome. Quick overview of the internals? AWESOME! Quick two line note
about how to use the library in your project? Awesome.

I'm tempted to rant about how I wish documentation was taken more seriously,
and that programmers seem to make it a point of pride that spending the first
half hour with a library figuring out how to actually use it is just something
we have to deal with as programmers, but I won't do so aside from this single
sentence.

~~~
wting
man pages used to be treated with the same reverence, but nowadays nobody
seems to bother. I have pretty extensive man pages for some of my open source
stuff, but it doesn't seem to help with users downstream.

~~~
deeviant
To be honest, I pretty much hate man pages.

They almost never have examples(Is that some sort of rule?), they have no
"quick summary of the shit you use 95% of the time", and they get generally
written as a novel, seemingly from the perspective of the developer of the
util rather than consumer.

Mind you, I have used man pages many times before, but only because it was the
best source of information, not because it was a particularly efficent one.

The Markdown README of this string library is a thousand times better than any
man page I have seen.

~~~
raverbashing
Man pages are bad but mostly useful

INFO pages on the other hand, are terrible. Is there something _less_
intuitive than the Info reader?

Navigation is awful

I think the RHIDE IDE had a better Info reader, IIRC, that one was useful.

~~~
sdegutis
I'm not sure it was ever meant to be intuitive or easy to use. As far as I can
tell, INFO pages (and much of UNIX) was intended to be a stop-gap solution
until something better and more permanent was written atop them. Unfortunately
that dream was never realized, and Linux's accidental popularity standardized
what was supposed to be a bunch of building blocks. I could be wrong though,
but almost every utility in UNIX screams this to me.

~~~
DougMerritt
INFO is a Richard Stallman thing, it's a GNU thing, it is emphatically NOT a
Unix thing.

Stallman wanted to replace the Unix thing (which was always man pages) with
INFO, and for many years deprecated man pages, which is part of why many _GNU_
man pages are sub-par.

Not that non-GNU man pages are perfect, but still.

As for Unix being intended to be a stop-gap, your impression is simply
historically incorrect, aside from philosophical issues like the claims made
in the infamous Gabriel essay "Worse is Better".

> I could be wrong though, but almost every utility in UNIX screams this to
> me.

Unix/Linux is certainly not perfect, but this simply reflects the truth of
Henry Spencer's aphorism, "Those who don't understand Unix are condemned to
reinvent it, poorly."

People who think Unix got it all wrong, as opposed to merely having assorted
warts, should read Raymond's "Art of Unix Programming".

I was more than a little startled that Raymond captured a lot of the truth of
the subject; it's a good read, and can potentially make anyone a better
programmer.

Edit: a more concise starting point:
[http://en.wikipedia.org/wiki/Unix_philosophy](http://en.wikipedia.org/wiki/Unix_philosophy)

~~~
luckydude
+1

Info really sucks compared to decent man pages. Sun did good man pages, go
look at them, they are quite good.

A lot of the man page hate can be traced back to crappy gnu man pages that
were just trying to get you to use info.

Info is cool, it's sort of like a web in text, but it isn't a replacement for
a decently written concise man page.

------
clarry
Mixing signed and unsigned arithmetic operands.

Not doing trivial overflow checking before arithmetic.

Using results from said arithmetic as indices or starting points for memory
writes.

My gut says that any application using this library to process untrusted input
is exploitable.

Of course, it was never advertised as being secure :-)

Which is unusual, as many "better strings" libraries make claims about the
security of the traditional way of string handling (and then go on to do it
wrong).

Edit: To give a concrete example, someone here recommended BString just a few
days ago. That library has a "security statement" in which it claims to
prevent overflow type attacks by _using signed integers instead of size_t_ ,
and then checking whether the result is negative after arithmetic. But signed
overflow is _undefined behavior_ in C, and these are not guaranteed to wrap
around. And yes such things have been exploited. I don't know how likely or
easy bstring is to "own" but in any case it's doinnit wrong.

And yes I checked the code, it does what it claims to do..

~~~
antirez
Hello clarry, I don't think there are APIs that if not grossly misused will
lead to security issues in general. There is one issue I'm aware of that
perhaps is not easily exploitable but surely is unexpected behavior, that is,
overflow when you reach a 2GB string: that requires a check in
sdsMakeRoomFor() for sure.

Note that this could be fixed by using uint64_t type for example in the
header, however in the original incarnation of SDS inside Redis this was not
possible for memory usage concerns. In the standalone library I believe it is
instead a great idea, since 2GB is no longer an acceptable limit.

From the point of view of the security the most concerning function looks like
sdsrange(), there are sanity checks in place to avoid that indexes received
from the outside can lead to issues, but I'll do a security review in the
future in order to formally check for issues.

~~~
eliteraspberrie
If it's any reassurance, I had a look at it, and although there is unnecessary
signed/unsigned arithmetic, I could not find anything exploitable. Change the
types of variables representing an object size, length, array index, and so on
to size_t (from int) and it will be fine.

~~~
asveikau
Might be acceptable for an individual project where you can make statements
about the sizes that will be used ahead of time, but for a general purpose
library your statement is absolutely false. The issue is that your size
requirement can overflow size_t, malloc gets passed a smaller value than you
really intended, then if you try to use what you think you allocated
(remember, it succeeded with a smaller size), it's a derefence outside the
bounds of the allocation. Changing the size or type of the length variable
won't solve that. This is a common and well known vulnerability pattern - it
is disappointing to see folks so dismissive of it.

~~~
eliteraspberrie
I'm aware of arithmetic overflow, thanks. It was the subject of my research
(which will eventually be open source, watch
www.peripetylabs.com/publications/ if you're interested). See my sizeop
library in the meantime:
[https://bitbucket.org/eliteraspberries/sizeop](https://bitbucket.org/eliteraspberries/sizeop)

I am not dismissive of the problem. I just know from experience that the
chances of code like that being fixed at all, let alone correctly, is near
zero.

------
antirez
I forked sds.c from Redis given that it was very useful for the project for
years, so I guess it may be useful in other contexts as well. There are a few
changes at API level compared to the Redis sds.c file, however most of the
work went into documenting it.

------
unwind
Nice, thanks for contributing (even more).

I haven't read much of the code at all, but a minor suggestion would be to
change:

    
    
        struct sdshdr *sh = (void*)(s-(sizeof(struct sdshdr)));
    

to

    
    
        struct sdshdr *sh = (void*)(s-sizeof *sh);
    

since the entire idea is that the pointer on the left-hand side is to the type
whose size should be subtracted, I think it's better not to repeat the type
but to "lock it" to the pointer instead. This also (of course) means we can
drop the parenthesis with sizeof, since those are only needed when its
argument is a type name.

~~~
simias
Heh, you're the first person I meet who advocates dropping the parens around
the sizeof argument in any situation. That does look terribly alien to me.

~~~
adestefan
sizeof is a unary operator and not a function. Only types need to be enclosed
in parens. Most people don't know this.

~~~
unwind
Exactly! For some reason this anti-pattern is very thoroughly entrenched among
C programmers and it's so very frustrating.

My other favorite is casting the return value of malloc(), something you see a
great deal of and that I always oppose. See
[http://stackoverflow.com/questions/605845/do-i-cast-the-
resu...](http://stackoverflow.com/questions/605845/do-i-cast-the-result-of-
malloc) for my best arguments.

~~~
Narishma
Casting the return value of malloc is necessary in C++. I think that's why
it's so widespread, so you can compile your C program with a C++ compiler.

------
Marat_Dukhan
This is neat, but please add a pointer to deallocation function to sds header
and use it in sdsfree. This will allow to return sds strings from functions in
shared libraries. A common problem it that the user of the library might use a
different version of libc than the shared library (this is especially typical
on Windows), and when it calls sdsfree on an sds string allocated with a
different libc, something awful will happen (in the best case, it will just
leak memory). By storing a pointer to the deallocation function in the code
which allocated the string you can make sure that it is always released by the
same libc version that allocated it.

------
shadowfox
At some point in time, I found this list of C String libraries with some
comparisons:
[http://www.and.org/vstr/comparison](http://www.and.org/vstr/comparison)

The (different) approaches taken by many of these libraries are interesting.

~~~
emmelaich
A great list.

I'd add the str bits of [http://libslack.org/](http://libslack.org/) to that
list.

------
Aardwolf
The biggest missing thing in C imho is destructors, because you need to
manually clean up everything whenever you leave a scope or function. That
means calling "free" at every exit point.

Or alternatively, putting one "cleanup:" label at the end of the function and
using "goto cleanup;" instead of break or return everywhere.

But goto is considered harmful, so that handy pattern seems unclean.

My question is: do you consider the "goto cleanup" pattern clean or not, and
if not, what are better alternatives?

~~~
antirez
Goto for cleanup is not bad IMHO. I use it extensively...

$ cd redis; grep goto *.c | wc -l 251

All the instances are like:

    
    
        goto cleanup;
        goto error;
        goto badfmt;
        goto numargserr;
    

In this context is easy to read and makes the code structure better.

~~~
coherentpony
I don't see the difference between this and

    
    
        cleanup(state);
        error(state);
    

etc.

Just pull them each out into a function. You're still DRY and don't lose flow
control.

In my honest opinion, there is never a _need_ for goto.

~~~
garenp
In performance critical nested loops, goto is the only way to exit early. A
typical example here is performing a matrix multiply. In cases like this, it
is absolutely necessary, because C doesn't provide a more granular 'break'.

~~~
dllthomas
You could leave nested loops with a condition flag, which could produce
something semantically equivalent to the goto. It seems like the logic here
might be simple enough that the actual check could be eliminated in the
generated code by a Sufficiently Smart Compiler (that could actually exist),
if there was sufficient demand for it. I'm not sure it's actually more
readable, though.

~~~
garenp
Yeah, of course I'm aware, it's just that there is a time/space trade-off, so
I think folks who argue that goto should never be used or has no legitimate
uses, are just taking too much of a hard-line.

And I think it just highlights a lack of understanding about what might make
goto "bad" in programming--IMHO it's "bad" when it makes control flow more
complex, which undermines maintainability. Common "cleanup" idioms that only
jump forward are not that bad, because they _don 't_ make control flow
particularly more difficult to follow. (Whereas jumping backwards, especially
across many state changes, can be very hard to follow.)

~~~
dllthomas
_" it's just that there is a time/space trade-off,"_

I don't see a time-space tradeoff here. Compiled naively, I expect the version
with flags to be slower and take more space. Compiled sufficiently smartly, I
expect them to be equivalent. I expect real compilers to be close enough to
the latter for most but not all purposes, but I think _if the flag version was
more readable and maintainable_ the place to focus (collectively, medium term)
would be on making compilers smart enough. Obviously, in the short term on
specific projects you do what you need to.

 _" I think folks who argue that goto should never be used or has no
legitimate uses, are just taking too much of a hard-line."_

I agree, but that's because there are places where use of Go To makes things
more readable and maintainable, not - primarily - because they are actually
_needed_ , per se, even with performance constraints (in the overwhelming
majority of cases).

 _' And I think it just highlights a lack of understanding about what might
make goto "bad" in programming--IMHO it's "bad" when it makes control flow
more complex, which undermines maintainability.'_

I agree wholeheartedly.

------
wbond
Recently I've been getting into some C and having programmed in a number of
other languages know enough to be sure I wanted to have a solid understanding
of properly handling encodings and strings in general. I've obtained a good
understanding, but now I am looking to abstract some of the low-level
mechanics. Unfortunately most of the libraries I have seen (such as bstring)
use structs. It was nice to see this approach taken. Thanks for extracting it
and making it easy to find and learn about!

On a related note - does anyone have recommendations for a similar small
library for dealing with conversions from char * to wchar_t * and basic
encoding duties? I'm working cross-platform, and so far I've stitched together
some functions wrapping stuff like wcsrtombs() and WideCharToMultiByte().

------
_paulc
Excellent news - the Redis source code is a really good source of high quality
dependency-free C code and tends to be the first place I look.

Some time ago I forked the SDS code and added a set of additional utility
functions around this for a project I was working on at the time (basic file
reading, regex, LZF compression, Blowfish encrypt/decrypt, SHA256 etc).

This was from a fairly old SDS version so this looks like a good opportunity
to sync up with the library version.

Repository is here:
[https://github.com/paulchakravarti/sdsutils](https://github.com/paulchakravarti/sdsutils)

------
eis
Disadvantage #1 can be simply solved by using a pointer to the "sds" type
instead.

Then you can be sure that

    
    
      sdscat(&s, "Some more data")
    

updates s to always point at the right memory address and you can't introduce
hard to find bugs by forgetting to assign to s which the compiler wont warn
you about. If you'd pass just "s" instead of "&s" as the first parameter, the
compiler would error out.

So all functions modifying the string should take a pointer to it.

------
sdegutis
My biggest gripe with C's string functions is that it's really easy to make
off-by-one errors when trying to do anything with two strings, due to NUL and
the way they seem to handle it inconsistently. I'm constantly having to check
the docs for any given string function to find out how it handles NUL (which
isn't always in an obvious spot thanks to the design of man pages). And once
I've found it, I have to come up with some contrived example that usually
needs to be written in a comment just to make sure I've used its intended
algorithm correctly and didn't cause an off-by-one error. If this library
hides all that for me, I'm sold.

------
FooBarWidget
Sigh, another string library. And this is the reason why I prefer C++ over C.
You don't end up writing _another_ string library for the 300th time. There
aren't that many differences between string libraries anyway. Most of them are
just structs with a pointer to a memory block, plus a length field.

~~~
antirez
Yep, but this one has not the usual layout of struct+pointer (that's the whole
point), and I'm not sure it qualifies as "yet another" since it was written in
2006.

~~~
pacaro
It appears to me to be "yet another" because the length before the string
approach is usually referred to as a b-string.

windows (for example) has had a comprehensive b-string library (type is called
BSTR) for about 20 years - due to it's age and provenance it has the downside
of thinking a character is a 16-bit value...

~~~
shadowfox
Just to add a reference to the parent:
[http://www.johndcook.com/cplusplus_strings.html](http://www.johndcook.com/cplusplus_strings.html)

(The article is mildly interesting by itself)

------
vvde
Making sds a typedef for char* is very convenient. But it makes it very easy
to pass an sds to a function that expects a C string without checking for null
bytes.

Ruby, Java, Perl, PHP have all had security problems when interacting with C
because they failed to properly distinguish binary-safe strings and C strings.

[http://insecure.org/news/P55-07.txt](http://insecure.org/news/P55-07.txt)
[http://cwe.mitre.org/data/definitions/626.html](http://cwe.mitre.org/data/definitions/626.html)

~~~
kzrdude
I'd prefer a typesafe version (that would be a library with a struct type). It
could even be a trivial wrapper struct for the char *.

------
scottlamb
Disadvantage #1 seems unnecessary. Instead of this:

    
    
      s = sdscat(s,"Some more data");
    

Why not do this?

    
    
      sdscat(&s,"Some more data");
    

The latter would make the use-after-free error they're describing impossible.
(Disadvantage #2, changing one reference but not others, would remain. And
callers would still need to check for NULL if they intend to handle ENOMEM
gracefully.)

I assert there's no meaningful performance difference between the two.

~~~
paraboul
To be consistent between allocation and reallocation (like malloc() and
realloc()).

And you still want to access the old value if reallocation failed.

~~~
scottlamb
> To be consistent between allocation and reallocation (like malloc() and
> realloc()).

Consistency is a tool that is often useful to promote program correctness.
You're suggesting using it to accomplish the reverse.

If it's important aesthetically that all sds operations be consistent in this
regard, you can structure the allocation interface in the same way:

    
    
      sdserror sdsnew(sds*);
    
      sds mystring = NULL;
      sdsnew(&mystring, "Hello World!");
    

This is a common practice in relatively new interfaces like pthread_create.

> And you still want to access the old value if reallocation failed.

If you want to provide commit-or-rollback semantics, you could signal error
via return value rather than by replacing s with NULL:

    
    
      if (sdscat(&s,"Some more data") != SDS_SUCCESS) {
        /* failure path; s is unchanged */
      } else {
        /* s now has "Some more data in it" */
      }
    

but this may be completely useless, depending on the environment(s) in which
the library or program is intended to be used. On 64-bit Linux systems with
memory overcommit enabled (the default), this failure path essentially
shouldn't ever happen. Instead, some process (maybe yours, maybe not) will be
picked by the kernel OOM killer. Many programs just use an allocate-or-die
interface, as the sds README mentions.

~~~
paraboul
You're right, I wasn't saying that it's the only way to write
allocation/assignment. I was saying that in that case sdscat() matches with
sdsnew() style.

------
numeromancer
NB: In the following function:

    
    
        /* Like sdscatpritf() but gets va_list instead of being variadic. */
        sds sdscatvprintf(sds s, const char *fmt, va_list ap) {
            va_list cpy;
            char *buf, *t;
            size_t buflen = 16;
        
            while(1) {
                buf = malloc(buflen);
                if (buf == NULL) return NULL;
                buf[buflen-2] = '\0';
                va_copy(cpy,ap);
                vsnprintf(buf, buflen, fmt, cpy);
                va_end(cpy); // <--- add this ----
                if (buf[buflen-2] != '\0') {
                    free(buf);
                    buflen *= 2;
                    continue;
                }
                break;
            }
            t = sdscat(s, buf);
            free(buf);
            return t;
        }
    

From the `man va_copy` on my system:

    
    
        Each  invocation  of  va_copy()  must  be  matched  by  
        a  corresponding invocation of va_end() in the same
        function.
    

This is not likely to be a problem in most systems, but it can't hurt to be
formally correct.

~~~
fmela
I think you meant "va_end(ap);".

Created a pull request for you:
[https://github.com/antirez/sds/pull/8](https://github.com/antirez/sds/pull/8)

~~~
numeromancer
No, I'm pretty sure that `va_end(cpy)` is correct. `ap` is the `va_list`
passed into the function. `cpy` is the local copy.

~~~
fmela
You're right. Fixed.

------
gtrubetskoy
I've had the pleasure of getting pretty deep into sds back when I was
tinkering with thredis, and it's definitely very cool.

------
acqq
The CString C++ class from Microsoft's MFC and now also ATL used since forever
the trick of having both the length and the reference counting in the single
allocation with the characters themselves, as well as additional 0-termination
character even if it was initialized with non-zero terminated character.

------
oh_sigh
In the case of disadvantage #1: s = sdscat(s,"Some more data");

You could fix that by adding a "remote pointer" header field. Inside of
sdscat, you would allocate a new sds struct, and set the remote pointer header
field in `s` to the new sds structs location. You could also try to do a
realloc and maybe youll get the same starting pointer back again.

------
mbreese
How much better is this dynamic approach than something like immutable
strings? (not that char* is immutable)

Since with the sds library, you could potentially be getting back a new
pointer for each operation, you could just as easily be working from an
immutable string library. Do immutable strings have poor performance? I've
never really considered it.

------
sparkie
Disadvantage #1 isn't really a disadvantage in modern computing, it's _the
right thing to do._

~~~
aidenn0
I strongly disagree. It is a tradeoff between that and advantages #1 and #3;
with the bstring library, for example, you pass in a mutable string and the
function mutates it.

In no world is passing in a mutable value, having it mutate it, but then
having to reassigning your variable superior.

Passing in an immutable value, and then assigning a fresh value is reasonable,
but that's not what SDS is doing, AFAICT.

I think gcc has some decorations you can have to tell it to never ignore the
return function.

I'm curious why s = sdscat(s,"Some more data"); didin't end up like:
sdscat(&s,"Some more data");

------
angersock
So, this is an interesting approach to strings:

It's basically a malloc()/free() implementation with printf, some formatting,
and strcat bolted onto it. (Strictly speaking, it may or may not use free
lists or whatever, but the use of the header and returned pointer is quite
similar.)

And that's awesome.

------
codys
ustr [1] is another library to consider. It uses a variable sized header to
avoid having large (percentage wise) overhead for short strings.

1: [http://www.and.org/ustr/](http://www.and.org/ustr/)

------
wildmXranat
I remember lifting the 'sds' string implementation from within the redis back
in the day. Thanks antirez

------
wereHamster
How does it compare to strbuf that is used in git?

