
Implementing a dynamically sized array in C - alinajaf
http://happybearsoftware.com/implementing-a-dynamic-array.html
======
klodolph
Nitpicks:

* Typo in the header file (SIZE versus CAPACITY).

* No error checking on malloc() or realloc() return.

* Out of bounds access is a programmer error, so you should call abort() and not exit(). Alternatively, assert() is fine too.

* Potential integer overflow for capacity field (size field can't overflow because capacity overflows first). On a 64-bit non-ILP64 system, resizing to a capacity of 2^31 elements will request a buffer of 0xfffffffe00000000 bytes, assuming sizeof(int) is 4.

* A 32-bit version could theoretically get an overflow, but only if a 2^31 byte allocation is successful... the next larger allocation will be for 0 bytes.

Edit: As someone pointed out in the comments, arr[271] is the 272nd element,
whereas the article claims it is the 270th element.

Further notes:

If you end up implementing a lot of these, invest in a consistent naming
scheme for the fields. For example, map, mapsz, mapalloc; then data, datasz,
dataalloc.

You can also try writing generics in C but I strongly recommend just switching
to C++ (or some other language) if the boilerplate code bothers you. I believe
that std::vector was the biggest reason for GCC's switch to C++ internally.

~~~
fruchtose
I've read that malloc(3) will not always return errors when you expect them to
[1]. Any comment on this?

[1] <http://news.ycombinator.com/item?id=4735589>

~~~
klodolph
I expect it to fail when it runs out of address space, or when address space
fragmentation prevents it from fulfilling the request. Isn't that exactly when
it fails?

------
RegEx
Fun read :)

I'm not sure if this is correct:

> By elements we mean the size of the type of the array. If an array holds
> elements of type int which each occupy four bytes, then integer_array[5000]
> is the same as
    
    
        *(integer_array + (5000 * 4)).
    

And that directly contradicts the earlier statement

>with subscripts e.g. my_array[271] would give us the 270th element. This by
the way, would do exactly the same thing as

    
    
        *(my_array + 271).
    

(which also may be incorrect: my_array[271] would give you the 272nd element
if you count my_array[0] as the "first" and 271st element if you count
my_array[0] as the "0th".)

I dug up the relevant part in the C standard just to be sure:

> The definition of the subscript operator [] is that E1[E2] is identical to
    
    
        (*((E1)+(E2))).

~~~
greenyoda
You're right, the article is wrong. If integer_array is an array of ints or a
pointer to an int, then integer_array[5000] is equivalent to

    
    
      *(integer_array + 5000), not
    
      *(integer_array + (5000 * 4)),
    

since pointer arithmetic in C automatically adjusts for the size of an array
element.

------
skrebbel
if you don't have a not-invented-here symptom, you could also consider re-
using the excellent stb stretchy buffers by the excellent Sean Barrett:

<http://nothings.org/stb/stretchy_buffer.txt>

~~~
apaprocki
Or simply use GLib's GArray:

<http://git.gnome.org/browse/glib/tree/glib/garray.c>

------
voidlogic
This is like 100 level C-S stuff, why is this on hacker news?

~~~
niggler
My guess: most people on HN don't have basic C competency.

~~~
voidlogic
That is pretty scary (and possibly true). If a hacker does not have basic C
competency, they probably do not understand how their code, C or otherwise is
executed...

I'm all for C-S programs having a focus on teaching how to learn, frameworks
of understanding and theory, but there need to be a few practical lessons
along the way, and C competency is one of those.

Maybe things have changed greatly since my undergrad but, the capstone of my
undergrad C-S program was writing a C(lite) compiler for MIPS in C. This is in
a program where for 60% of the classes the language in use was Java. I think
this was a fantastic capstone choice.

Students were expected to have deep understanding of C and Java and a working
knowledge of Scheme, Haskell, Prolog with HTML/CSS/Javascript never taught due
to the expectation you teach them to yourself on the way like everyone else.
Depending your electives you might have also picked up C#, Python and
Objective-C.

------
gsg
Allocating memory on initialisation is a bit questionable. It's better to
alloc no memory and let the growth logic handle that case (either by checking
for zero or by including an offset, such as vec->capacity += vec->capacity +
2).

The reason is code like this:

    
    
        vector vec;
        vector_init(&vec);
        for (i = 0; i < n; i++)
    	   vector_append(vec, some_operation(n));
        whole_vector_operation(&vec);
        vector_free(&vec);
    

It's both natural and desirable for this code to do no allocation when n = 0.

Since this is C, you might also include a macro to allow for static
initialisation: #define VECTOR_INIT { 0, NULL, NULL } or something.

------
shadowmint
Weird implementation. If you want an magical 'append' function, use a stack,
or a linked list.

There's no reason you'd want to 'append' to an array.

(Imagine the situation you remove an item in the middle; what is append
supposed to do now? What is remove supposed to do? The append function in this
example is basically just 'make_me_bigger')

Still, if you did want to, I'd be doing this to reduce verbosity:

    
    
        #define _V(a) ((int *) &a)
        struct Vec { 
          int *data;
          int size;
          int offset;
        }
    
        struct v;
        _V(v)[5] = 10;
        int value = _V(v)[5];
    

You only need two special calls; append() (aka. make this array longer) and
free().

~~~
klodolph
The _V() macro horrifies me. It's not remotely safe, nor is it even slightly
correct. From least important to most important,

1\. The _V() macro is in the compiler namespace, which is a no-no. You can't
legally use identifiers that start with underscores which are followed by
upper-case ASCII characters.

2\. The _V() macro performs no type checking. It takes any addressable object
as an argument. You can't do _V(3), but you _can_ do _V(printf).

3\. The _V() macro scribbles on memory.

    
    
        #define V(a) ((int *) &a)
        struct Vec { int *data; int size; int offset };
    
        struct v v; // don't forget the second v
        _V(v)[5] = 10; // this scribbles on the stack
    

What the macro does is take the address of the Vec structure (or anything else
addressable, since it does no type checking) and pretends that that address is
the address of an integer array.

The stack:

    
    
        |            |
        +------------+
        | int *data; | <---- &v, (int *) &v
        +------------+
        | int size   |
        +------------+
        | int offset |
        +------------+
        | <stack>    |
        +------------+
        | <stack>    |
        +------------+
        | <stack>    | <---- &_V(v)[5] scribbles here
        +------------+
        |            |
    

On a 32-bit system, _V(v)[5] is three words after the Vec structure, somewhere
on your stack.

~~~
shadowmint
My bad, you are correct, that's not quite right. You'd do it like this:

    
    
      #include <stdio.h>
      #include <stdlib.h>
    
      #define _V(a) (*((int **) &a))
      struct Vec {
        int *data;
        int size;
      };
    
      int main (int argc, char** argv) {
        struct Vec v;
        v.data = (int *) malloc(sizeof(int) * 10);
        v.size = 10;
    
        printf("%lx <-- actual\n", (long) v.data);
        printf("%lx <-- cast\n", (long) _V(v));
    
        for (int i = 0; i < 10; ++i) {
          _V(v)[i] = i;
        }
        for (int i = 0; i < 10; ++i) {
          printf("%d was: %d\n", i, _V(v)[i]);
        }
    
        free(v.data);
        return 0;
      }
    

...

    
    
        roar:~$ valgrind ./a.out
        ==30957== Memcheck, a memory error detector
        ==30957== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
        ==30957== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
        ==30957== Command: ./a.out
        ==30957==
        51f1040 <-- actual
        51f1040 <-- cast
        0 was: 0
        1 was: 1
        2 was: 2
        3 was: 3
        4 was: 4
        5 was: 5
        6 was: 6
        7 was: 7
        8 was: 8
        9 was: 9
        ==30957==
        ==30957== HEAP SUMMARY:
        ==30957==     in use at exit: 0 bytes in 0 blocks
        ==30957==   total heap usage: 1 allocs, 1 frees, 40 bytes allocated
        ==30957==
        ==30957== All heap blocks were freed -- no leaks are possible
        ==30957==
        ==30957== For counts of detected and suppressed errors, rerun with: -v
        ==30957== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)

~~~
klodolph
So, what's the point? Why not just use...

    
    
        v->data[i] = x;
    

This is, drum roll... type checked by the compiler, unlike _V().

P.S. Stop casting the result of malloc(), it makes you look like you're trying
to write C++. There's no good reason for it in C, and there's a good reason
not to do it (because it can mask an error if you forget to include
<stdio.h>).

For this reason, and because it looks like you want generics, it sounds like
you want to write C++. Why not write C++? Then you can just overload
operator[] and go on your merry little way, without having to circumvent the
type checker with macros.

~~~
shadowmint
Who said anything about generics?

You could just #define v(a) (a->data) and be done with it, it's true.

------
zeteo
> // set a value at an arbitrary index

> // this will expand and zero-fill the vector to fit

> vector_set(&vector, 4452, 21312984);

Love to see a C article where we get some insight on how dynamic arrays might
work behind the scenes. But sparse arrays are best implemented as linked lists
or hash tables.

------
drudru11
There are much better libs out there that do this.

I always use the utarray and uthash stuff when I can.

~~~
tsahyt
I believe this article was mainly for educational purposes though.

------
halayli
This C code is bad for the following reasons:

1\. No alloc checks 2\. This module is limited to storing integers. 3\. The
type Vector is exposed to the user.

An alternative approach for contiguous dynamically sized array is to use
mmap/madvise.

~~~
jws
Strangely, alloc checks no longer work in Linux and possibly other OSes.
Because of overcommit you can have malloc give you an non-NULL result that
will fail when you try to write to it.

I still include them in my code, but I don't expect they will save me.

~~~
halayli
malloc failure isn't just bound to running out of physical memory. It can fail
because of fragmentation, with or without overcommit.

~~~
jws
Sure, but I still have to design for random segfaults on memory writes. I
guess I could write to all newly allocated pages and get a little more
assurance that the allocation worked. I could still be whacked by a
deduplicator, though unique writes would help.

~~~
halayli
I think when you run out of physical memory & swap, there's a bigger issue at
that point. Or that's what the designers of overcommit decided.

------
robomartin
It's really unfortunate to see so many pedantic posts on this thread.

The article author explored something that probably isn't obvious to most
programmers who didn't come up from low level work. And I am sure this isn't
on the radar for your average PHP, JS programmer. And that's fine.

No, the code isn't production ready. I didn't get the sense that this was the
article's intent. The author actually did a very nice job of explaining one
approach. Good article and very accessible . Ease up. Help him make it better
without being so condescending.

------
edanm
This is a great article for anyone wanting a "behind-the-scenes-look" at how
things work.

It's also great in reminding me: the fact that we've moved to automatical
memory managing languages is a HUGE LEAP FORWARD in computing, possibly the
biggest leap forward in the last few decades. The amount of code you _don't_
have to deal with, the amount of bugs you're _not_ making on a daily basis, is
astounding.

Thank God we're in the age of modern languages. Makes me wonder what the next
great leap forward in computing will be.

~~~
jxi
I wouldn't call automatic memory management a "huge leap forward" at all. Sure
it makes code a bit more concise and eliminates leaks (more or less), but it
makes your code less performant and just shifts the problem away from the
programmer to the VM instead. It's a leap into lazyness, if anything. It also
creates problems since now people think it's OK to "new" objects within a for
loop freely

You can probably tell by now that my goto language for any kind of large scale
application is still C++ w/ STL for safety :). For small hacks, obviously I'd
use python, js or ruby or something like that any day.

~~~
thangalin
On a slightly related note, a just-in-time compiler with a superb garbage
collector can be more performant in some situations due to its ability to
profile and optimize the code while the application is running.

Such runtime statistical analysis and real-time optimization is not possible
in an application that is compiled to native code.

For $189, I can slap a 240GB SSD in the machine to make software more
performant. The question you have to ask is: How much optimization can a
senior-level programmer accomplish for $189?

JDeveloper (an IDE) takes about 30 seconds to start on my co-workers'
computers. On my laptop, boasting a 240GB SSD, JDeveloper is ready in about 5
seconds.

See also: <http://stackoverflow.com/a/145122/59087>

And see also: <http://www.ibm.com/developerworks/java/library/j-jtp09275/>

~~~
burstmode
>On a slightly related note, a just-in-time compiler with a superb garbage
>collector can be more performant in some situations due to its ability to
>profile and optimize the code while the application is running.

No Garbage collector can be faster then a programm that produces no garbage at
all, because its programmer knew what he was doing.

>For $189, I can slap a 240GB SSD in the machine to make software more
>performant. The question you have to ask is: How much optimization can >a
senior-level programmer accomplish for $189?

This ist the typical academic way of optimization: Let's write some programm
and then throw hardware on it until it runs fast enough. In the real world,
ressources are limited. That $189 are needed for each install of your garbage-
collected software, not just YOUR personal develoment machine.

~~~
thangalin
[http://www.crucial.com/store/partspecs.aspx?imodule=CT032V4S...](http://www.crucial.com/store/partspecs.aspx?imodule=CT032V4SSD2)

SSDs are now $51. I don't see this as an academic means of optimization; I see
it as a practical business value. Why would I pay a developer to spend 8 hours
on optimizations when I can pay a developer to spend 8 hours on features and
bug fixes? Meanwhile, I can drop $50 today (as opposed to $200 8 days ago) and
get SSDs for the database, web server, and two developer machines?

If SSDs were down to $1 for 64 GB, would you still claim throwing hardware at
the problem is an academic solution?

------
capkutay
Seems relatively similar to implementing a dynamically resizable in Java
(assuming you're not allowed to use an array list). The only difference is the
malloc business, which isn't necessarily safe given that there's no check to
see if the pointer is null after the mem allocation. It seems like the only
critical piece is checking the array size, if it's too small, double the size
of the array.

------
zamio
Here is a nice implementation of dynamic generic arrays in C:

[https://github.com/twitter/twemproxy/blob/master/src/nc_arra...](https://github.com/twitter/twemproxy/blob/master/src/nc_array.c)

[https://github.com/twitter/twemproxy/blob/master/src/nc_arra...](https://github.com/twitter/twemproxy/blob/master/src/nc_array.h)

------
AYBABTME
I think it would be wise to add a `shrink` operation when the array ends up
being 1/4 full (shrink to 1/2 size). Otherwise you might end up hoarding
memory you don't need anymore. The operation would be called automatically
whenever you delete an item from the array, checking if a `shrink` is
appropriate.

------
joe_the_user
This approach pretty much depends on the underlying OS to do the reallocation
of memory.

It is usually a good idea to do this. At the same time, if performance is an
issue, you could look at what other allocation algorithms are available.

------
DrJokepu
There's really no need for the pointer inside the struct. You could just use a
flexible array member (introduced in C99) or a "struct hack" if you're still
using C89.

~~~
to3m
If you do this, the Vector could likely move when resized.

~~~
DrJokepu
True, the vector_append function would have to return a pointer to the new
location; this might or might not be a problem, depending on how your program
is structured.

------
jmspring
Memory pools, dynamic arrays, custom vector implementations, etc. Pretty sure
most have been done to death. Find the one that meets your usage and move
along.

------
era86
reminds me of my college days!

------
tuananh
any idea what causes this? the code run fines on Windows but when i tried on
os x (with sublime text), i got this error instead.

    
    
        symbol(s) not found for architecture x86_64

