Generic dynamic array in 60 lines of C

orbital223 · on Feb 28, 2023

DYN_ARR_RESET should probably be called DYN_ARR_INIT instead, as calling it more than once will leak memory.

The handling of endptr in DYN_ARR_RESIZE seems to be incorrect. If I have an array with 2 elements and capacity of 3 and I DYN_ARR_RESIZE it to 5, I now have an array with 5 elements, 3 of which are garbage values.

aslilac · on Feb 28, 2023

~~Isn't the handling of `endptr` incorrect in `DYN_ARR_RESET` as well? `a.endptr = a.data;` feels very incorrect to me.~~

Nevermind, I see how it's supposed to work now.

To your point, `RESIZE` is definitely incorrect. It should be checking the length before calling `realloc`.

parasti · on Feb 28, 2023

Thats just idiomatic C. Set the "pointer to end" to be equal to the "pointer to start".

matheusmoreira · on Feb 28, 2023

The values are garbage because they haven't been initialized yet. Other code may write to those locations.

patrick451 · on March 1, 2023

> The handling of endptr in DYN_ARR_RESIZE seems to be incorrect. If I have an array with 2 elements and capacity of 3 and I DYN_ARR_RESIZE it to 5, I now have an array with 5 elements, 3 of which are garbage values.

Why do you say this is wrong? That's exactly what I would expect.

orbital223 · on March 1, 2023

> Why do you say this is wrong?

Exposing uninitialized memory as valid values tends to be a really bad idea.

nice_byte · on Feb 28, 2023

that's the intended behavior.

this somewhat mirrors what happens with a std::vector when you call resize() except of course c doesn't have default ctors. the point of resize is to make exactly "size" elements of the dynamic array addressable.

klyrs · on Feb 28, 2023

DON'T USE THIS

As fpoling points out, capacity is a 32-bit unsigned. That can overflow. There is no safety in append: if capacity is zero no new space will be made, if capacity overflows the realloc will not have enough space -- in either case, you end up writing past the end.

Allocating a [capacity] zero array and appending to it is extremely common. That you'll write past the end in that case shows that the author has barely even used this.

The code is unacceptably bad and unsafe. I don't usually do this, but I'm going to flag this post. I encourage everybody to do the same.

Apologies to the author. Golf is fun. This is uncool.

edit: changed size to [capacity]

marginalia_nu · on Feb 28, 2023

Is this overflow really a realistic scenario? I imagine most machines won't even permit you to allocate the sorts of sizes that would cause this bug.

In that scenario (where you're allocating a multi-gigabytes array), you typically know the size ahead of time, instead of growing it dynamically, as the latter doesn't really perform too well.

klyrs · on Feb 28, 2023

2Gb (1<<31 + 1 = boom) just isn't that much memory these days. If you're parsing large text files, for example, it's really easy to blow that limit. And what's the point of this dynamically-growing array if not to forget about managing capacity?

nice_byte · on Feb 28, 2023

I have been using this for years. Capacity is not size. A zero size array will have non zero capacity. All c code is unsafe.

klyrs · on Feb 28, 2023

Where do you check that capacity is nonzero? When capacity is zero, what happens on this line?

     a.capacity <<= 1u;

"all c code is unsafe" is not an excuse to permit bloody obvious, undocumented memory overruns.

I write a lot of c. Avoiding the unsafe bits, avoiding UB, is the skill required to write good c. "C code is unsafe" is a Rustacean marketing slogan. Don't believe it, but definitely don't practice it.

nice_byte · on Feb 28, 2023

just don't initialize it with 0 capacity :-)

skullt · on Feb 28, 2023

Why not just fix it? It's a trivial fix and has the added benefit that a zero-initialized DYN_ARR_OF(x) struct would be in a valid state, which is always nice. A struct with several dynamic arrays can then much simpler to initialize, for example.

acmj · on Feb 28, 2023

For a library, if the user input is wrong, throw an error or an assertion failure. It shouldn't be causing an uninformative segmentation fault which your library does. The fix is a trivial. Why so defensive?

andrewmcwatters · on Feb 28, 2023

I don't like the use of macros for things like this. Macros in general should rarely or sparingly be used.

I'm also not certain one should use "end pointers." Conventionally, it seems more advisable to use `size_t capacity`, `size_t length`, and `void *data`.

Great use of Cunningham's Law, though! I appreciate C posts on Hacker News.

matheusmoreira · on Feb 28, 2023

One should always use size_t to keep lengths. Reallocating the array invalidates all pointers but does not invalidate offsets.

andrewmcwatters · on Feb 28, 2023

Great point, thank you. I've never actually heard it expressed this way before, but now I have another reason to prefer explicit lengths.

matheusmoreira · on Feb 28, 2023

A concrete example that I've been working on recently: a parser/lexer backed by an input buffer. Its state looks like this:

  struct parser {
    struct {
      unsigned char *buffer;
      size_t capacity;
    } buffer;
    struct {
      size_t read;
      size_t write;
    } position;
  };

The lexer makes progress by consuming bytes from the buffer, incremeting the read position. The buffer is reallocated whenever such a read would overrun the write position: I/O functions are called to fill up the buffer and push the write position further ahead.

Had they been pointers within the [buffer, capacity[ range, they would have become invalid whenever buffer is reallocated for expansion since its location in memory would change. So I chose to implement the read and write heads as offsets to the buffer's base address and calculating those pointers only when they're needed.

User23 · on Feb 28, 2023

Amusingly, Ward Cunningham denies inventing Cunningham's law and I feel compelled to correct that potential misunderstanding.

andrewmcwatters · on Feb 28, 2023

How meta...

nice_byte · on Feb 28, 2023

the alternative in c is something like glib's array container which hides types. I dislike that more.

andrewmcwatters · on Feb 28, 2023

Yes, but GLib arrays are also lossy. They require you to externally manage capacity or to know with perfect foresight exactly what your capacity will be for the lifetime of the memory allocation.

I don't think those are impossible scenarios, but the cost of one additional pointer in terms of size leaves you with a lot more functionality. Saving one pointer in size and not having capacity makes GLib arrays nearly useless, which I find confusing.

You could simply pass a pointer and a size around instead. Which is what most people actually do when they don't need resizable data layouts.

If you're working with C, I think you have to just accept that void pointers happen. Working around losing compile-time type data requires you to create runtime structures, which I don't find acceptable.

jxy · on Feb 28, 2023

What does it actually gain by using macros instead of proper functions? The only generic macro that can't be written in function is the one use `type`, but saving the type size in the struct is enough for this kind of code.

marcodiego · on Feb 28, 2023

I think this author should learn about the "do {} while(0)" trick.

HarHarVeryFunny · on Feb 28, 2023

What's the benefit of do {...} while (0) vs just {...} ?

LegionMammal978 · on Feb 28, 2023

You can put it inside a braceless if/else statement and follow it with a semicolon.

Compare:

  #define FOO() { ... }
  #define BAR() do { ... } while (0)

In the former case:

  if (x)
    FOO();
  else
    ...;
  
  // becomes
  if (x) {
    ...
  }
  ;
  else
    ...;

In the latter case:

  if (x)
    BAR();
  else
    ...;
  
  // becomes
  if (x)
    do {
      ...
    } while (0);
  else
    ...;

tines · on Feb 28, 2023

The latter makes the macro usable without requiring a semicolon, which some people don't like; the former will cause a syntax error if the semicolon is missing, so they feel that it regulates syntax a bit better.

nice_byte · on Feb 28, 2023

Works if you want to e.g. skip braces if if..else... statements. Some would say putting braces in always is good style, but I like skipping them myself so won't defend it :)

GrumpySloth · on Feb 28, 2023

I agree it’s good style, but surrounding code using bad style is still poor excuse for writing macros which break.

GrumpySloth · on Feb 28, 2023

do {…} while (0) is a statement, so it can be put between “if (…)” and “; else”, while just {…} would result in a syntax error due to a hanging else.

HarHarVeryFunny · on Feb 28, 2023

OK, but I guess what's really being relied on there is that it's syntactically still a single statement when you follow it with ';', whereas {...}; is two statements (which is what breaks the if-else).

I just discovered that gcc also supports non-standard "statement expressions" of the form ({...}) which would serve the same purpose at the cost of portability.

wzdd · on Feb 28, 2023

This seems pretty sensical to me except that it achieves its small size by sacrificing error checking and handling, even in the (aiui) non-exceptional case of realloc() returning NULL. This is of course classic C program behaviour so perhaps that's fine ;-)

WalterBright · on Feb 28, 2023

Heavy use of macros to do metaprogramming is a strong sign it's time to move to a more powerful language.

flohofwoe · on Feb 28, 2023

On one hand that's obviously true, on the other hand it's just 60 lines of code, which is easier to check and audit than (for instance) a typical std::vector implementation - I don't know what the equivalent is in D, please forgive my ignorance :)

tines · on Feb 28, 2023

A typical std::vector implementation supports a lot more things, so it's not an apples-to-apples comparison. It's like saying that a horse-drawn carriage is a lot easier to repair than a motor vehicle. It's true, but doesn't really say much.

jalino23 · on Feb 28, 2023

I don't have this luxury if I want to learn about ffmpeg libraries, its all written in c, almost all media and hardware accelerator libraries are written in c, you go to another language they just bind to c. its soo frustrating but I've no choice but to stick with c.

jpollock · on Feb 28, 2023

Most languages have the ability to call C functions. Use the language that helps you write your code and convert to call the third party API.

For example, a C++ Vector provides a generic container and the code can still call C functions with the underlying array.

This is in the "Doctor it hurts if I do X" bucket.

https://stackoverflow.com/questions/2923272/how-to-convert-v...

cozzyd · on Feb 28, 2023

Yes, but if you're writing code that you want to be bindable to other languages, you essentially have to follow the C ABI, in which case, writing in C is a natural choice (yes, you can expose a C ABI from other languages, but that's usually not natural and you sort of have to understand C anyway to do that).

int_19h · on Feb 28, 2023

Unless your API surface constitutes the majority of your code, writing in C still doesn't make sense, because you're paying the tax every time you have to manually juggle strings and whatnot. C++ is a more sensible choice for this - you can still write C-compatible headers (and it's trivial to verify in builds - just include it in a .c file and see if that compiles!) while using STL etc in the implementation.

cozzyd · on Feb 28, 2023

Yeah if you're using opaque handles (where what you have is really a pointer to some C++ thing), this can work well.

If you're exposing structs that are passed back and forth for state, it's a bit more cumbersome.

flohofwoe · on Feb 28, 2023

Mixing high level C++ stdlib classes with low level C APIs is often not exactly trivial, for instance most C++ containers expect that their items are RAII compatibel, and the resulting memory corruption errors will be harder to debug than plain old C code because the shit hits the fan deep inside some completely unreadable C++ stdlib implementation code.

ammar2 · on Feb 28, 2023

huh? What's wrong with using the higher level languages if they take care of binding to the C parts for you.

(I know these are far an in-between for ffmpeg, too many leaky abstractions.)

flohofwoe · on Feb 28, 2023

Usually increased build system complexity, outdated manually maintained bindings, etc etc... - in some languages (like Zig) this is really trivial, in other languages it can be much harder.

jalino23 · on Feb 28, 2023

try to find ffmpeg in go or rust that is < 5 years old and complete. how about intel libva,

WalterBright · on Feb 28, 2023

D code can call C functions directly.

cperciva · on Feb 28, 2023

C + a set of macros is a more powerful programming language.

Night_Thastus · on Feb 28, 2023

Also a far worse language though. Macros have their place, but trying to do anything complex with them just turns into a total nightmare. They're hard to reason about, walk through, or modify.

If heavy macro usage is found, it's definitely time to reconsider the approach.

cperciva · on Feb 28, 2023

It really depends on how macros are used. If you're just using them to implement "high level language features", it's not a problem; sure, you might have trouble figuring out what STAILQ_INSERT_TAIL does internally, but you're going to have just as much trouble figuring out what the Lisp or Perl or Python "add this item to the end of that list" operations do internally.

Macros can be a nightmare, but when they're used properly they're not.

WalterBright · on Feb 28, 2023

> when they're used properly they're not

Yeah, they are.

Source: Decades of C programming

cperciva · on Feb 28, 2023

You're not the only person with decades of experience.

WalterBright · on Feb 28, 2023

We all suffer from this, so it's not personal, but one is too often blind to the shortcomings of one's own code.

cperciva · on March 1, 2023

You must be aware that I'm not the author of BSD sys/queue.h.

Gibbon1 · on Feb 28, 2023

If were me I'd use inline functions instead of macro's

wheelerof4te · on Feb 28, 2023

C is powerful enough.

eimrine · on Feb 28, 2023

Lisp?

tmtvl · on Feb 28, 2023

When it comes to macros, accept no substitute.

Granted, Lisp already has arrays, so not much reason to reinvent the wheel.

  (make-array '(2 6) :initial-element 0 :element-type '(unsigned-byte 32))
  ;; => #2A((0 0 0 0 0 0) (0 0 0 0 0 0))

lisper · on Feb 28, 2023

Yes! :-)

alar44 · on Feb 28, 2023

Yeah that's a great opinion and all but generally C is used because it HAS to be used these days. No one is going to use this code for building a web application or will be tossing it into legacy code.

nathants · on Feb 28, 2023

awesome! i do the same thing for arrays[1] and maps[2].

stuff like this is great when you are trying to find the performance ceiling of some workload in c/cpp. literally nothing to hide.

1. https://github.com/nathants/bsv/blob/master/util/array.h

2. https://github.com/nathants/bsv/blob/master/util/map.h

_0w8t · on Feb 28, 2023

The code has integer overflow even on 64 bit CPU due to using uint32 for capacity.

kiryk · on Feb 28, 2023

I like that the comments mention another two implementations of a similar idea and all are a bit different.

Here's one I've written a few years ago: https://github.com/kiryk/mlisp/blob/master/vector.c

You may find it dirty, but it can be wrapped using macros, and used both on LHS and RHS like in "string(v, i) = ch"

(the macros BTW) https://github.com/kiryk/mlisp/blob/71d028738d2f9607fa7ce8c1...

(and a usecase) https://github.com/kiryk/mlisp/blob/71d028738d2f9607fa7ce8c1...

kzrdude · on Feb 28, 2023

Not an entirely uncommon idea. I've written one.

There's also a well-known one here, in klib: https://github.com/attractivechaos/klib/blob/master/kvec.h

acmj · on Feb 28, 2023

Probably many have written a vector header like this. It baffles me why this one reaches the front page of HN given that the implementation is not that great. A few problems (some have been mentioned by others as well).

1. not using "do {} while (0)". This may lead to compile errors.

2. using uint32_t for capacity. On 64-bit machine, this doesn't save memory.

3. DYN_ARR_RESIZE() may have a quadratic time complexity if some is calling DYN_ARR_RESIZE(a, 10); DYN_ARR_RESIZE(a, 11); DYN_ARR_RESIZE(a, 12) etc in a loop. kv_resize() in kvec.h wouldn't have this problem.

4. Segfault if capacity is 0.

bigyikes · on March 1, 2023

It’s better that the code isn’t perfect. Reading the code review comments in this thread has been really valuable for me personally. Like, I’m not sure I would’ve learned about the do-while trick had so many people not pointed it out.

globalreset · on Feb 28, 2023

Who didn't. Almost any C program dealing with strings and collections has to have their own implementation or import one.

Part of the reason why C developers "feel" productive, but can't produce anything of meaningful complexity.

rramadass · on March 1, 2023

>Part of the reason why C developers "feel" productive, but can't produce anything of meaningful complexity.

https://quoteinvestigator.com/2010/05/17/remain-silent/

kzrdude · on Feb 28, 2023

You hurt me with the truth.

Anyway, I think this HN posting has this "crazy" flavor of using macros to simulate generics, and that's the specific kind of implementation that I meant, which klib also does.

rootw0rm · on Feb 28, 2023

I think sqlite is fairly complex

matheusmoreira · on Feb 28, 2023

Have you ever heard of Linux?

nice_byte · on Feb 28, 2023

til the Linux kernel has no meaningful complexity

kevin_thibedeau · on Feb 28, 2023

Still waiting for a revolutionary OS written in $virtuous_lang to save us from the scourge of unsafe Linux. They've had since Ada-83 to get something off the ground. Guess they aren't so productive either.

shaggie76 · on March 1, 2023

Counterpoint to people suggesting size_t for capacity: we recently changed our in-house std::vector replacement from { T* first, last, end; } to { T* first; u32 size, capacity; }.

Not only did this shave 8 bytes off of each instance but in many cases it saved more because they became 16 bytes which reduced alignment padding in other objects that use SIMD types that require 16-byte alignment.

This saved 40 MB of memory for us in https://warframe.fandom.com/wiki/Orb_Vallis which, given the poverty our mobile minspec, was a terrific savings.

Your needs may vary, but it's 2023 we're still cramming 4GB of poop into a 2GB bag.

andrewmcwatters · on March 1, 2023

In game development, you'll want to do things like this often, but general purpose programming shouldn't adopt this practice. It's literally what the type is for.

nice_byte · on Feb 28, 2023

yes, i know growth by a factor of 2 has issues under certain usage patterns. those are less likely to be problematic when there is more stuff going on than just growing the buffer - you can use the "hole" for other allocations.

rkeene2 · on Feb 28, 2023

Hard to tease out what you're saying here, but maybe a Hashed Array Table (HAT) is what you're referring to ? I have a slightly extended version here [0][1]

[0] https://rkeene.org/viewer/tmp/hat.c.htm [1] https://rkeene.org/viewer/tmp/hat.png.htm

quibono · on Feb 28, 2023

I think GP meant that his implementation doubles the capacity of the array on resize as opposed to increasing by 1.5 or some other factor.

hgs3 · on March 1, 2023

The biggest problem with this is that you cannot assign, pass, or return an "instance" of the array struct itself. That's because it's declared as an unnamed struct and in C two unnamed structs are not type compatible even if they have identical fields. C23 did improve struct compatibility [1] but unfortunately not for unnamed structs.

[1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3003.pdf

comex · on March 1, 2023

You can if you typedef it.

hgs3 · on March 1, 2023

Clever. Seems so obvious in hindsight.

latenightcoding · on Feb 28, 2023

this one does RAII in C and it's most likely faster: https://bit.ly/3ICOplU

beej71 · on March 1, 2023

Even this one appears to leak memory on failed `realloc()`.

RcouF1uZ4gsC · on March 1, 2023

Any sufficiently complicated C program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of C++.

olliej · on Feb 28, 2023

I dislike this - having the “array” be a struct containing the pointer and size fields makes it easy to “copy” the array such that you get dangling pointers. Similarly there’s no way to track lifetime or ownership of the array.

There are long term ABI benefits to these data structures just being opaque pointers outside of the implementing libraries

nice_byte · on Feb 28, 2023

no reason for it to ever cross abi boundary.

posharma · on Feb 28, 2023

Seriously, in this day and age, why are we still stuck with C? There are battle tested container libraries in C++ for e.g.

not_the_fda · on March 1, 2023

Yep, once you find yourself writing containers that are found in C++ its time to switch to it.

I've seen too many C developers re-write C++ containers in C because they are afraid of C++, its madness.

GuB-42 · on March 1, 2023

C++ is annoying because of name mangling and things like static initialization calling constructors.

The result is that you can easily link C code to almost any language, including C++, with almost any linker. But for C++, you usually have to use the linker that comes with the C++ compiler that compiled your library. You can write code in C++ that is as compatible as C, but you have to go out of your way to achieve that, extern "C" is only the beginning.

As a result, when the overhead of using C instead of a more complete language is not too great, I prefer to write my libraries in ANSI-C, for maximum compatibility.

rurban · on March 1, 2023

Indeed, C++ is madness. Unsafe iterators, insane template language compared to cpp

up2isomorphism · on March 2, 2023

The post demonstrated one thing: don’t share your code.

Not to say I think this particular implementation is my favorite, but why anybody should expect a free online code snippet should be bulletproof is beyond me, it is a damn gist, a demo.

sitkack · on Feb 28, 2023

This isn't C, this is preprocessor. Tomato tomato, who cares if it is 60 lines or 600.

https://doc.rust-lang.org/src/alloc/vec/mod.rs.html#400

https://docs.rs/containers/latest/src/containers/collections...

lolcatuser · on March 1, 2023

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf

Section 6.10 (Page 145): Preprocessing Directives

Hey, would you look at that! The preprocessor is a mandatory part of the language!

saagarjha · on Feb 28, 2023

Why use macros instead of inline functions?

flohofwoe · on Feb 28, 2023

Because the API needs to work for random item types. C doesn't have templates (and for stuff like this C11's _Generic isn't helpful because that just maps specialized function names to generic function names, but doesn't help with the generic implementation code).

vore · on Feb 28, 2023

Inline functions can't be made type safe (or at least, as type safe as C gets) over generic arrays.

halayli · on Feb 28, 2023

  a.capacity <<= 1u;
  decltype(a.data) tmp = (decltype(a.data)) realloc(a.data, sizeof(a.data[0]) * a.capacity);
  assert(tmp != NULL);

That's not ideal. Imagine you are at 32GB capacity, the next realloc will ask for 64GB which is pretty excessive.

eqvinox · on Feb 28, 2023

This isn't suitable for arrays that large regardless of the resizing multiplier.

realloc() generally allocates a new block of memory of the new size, copies the entire content over, and then frees the old block. Only sometimes you get to be lucky and have enough spare room after the existing data in the virtual address map to not need that copy.

By the point this copying becomes relevant, a continuous dynamic array like this becomes fundamentally the wrong data structure.

euclaise · on Feb 28, 2023

My understanding is that this is the typical behavior of dynamic arrays

rurban · on March 1, 2023

Only for bad dynamic array implementations.

Good ones might resize by the golden ratio, and enlarge by blocksize if larger

thdespou · on March 1, 2023

Macros are evil. I will stick to zig at the moment.

downvotetruth · on Feb 28, 2023

data (startptr), endptr & capacity?! Memory might be cheap enough to waste, but more derefs do not help timings.

JonChesterfield · on Feb 28, 2023

Three pointers (or two and a size, or one and two sizes) is pretty standard for dynamic arrays. It lets you allocate more than one element a time when repeatedly appending one element.

No checking malloc or realloc though. The anonymous struct is dubious too, though maybe being unable to pass these things to functions is a feature.

nice_byte · on Feb 28, 2023

functions can work with just pointer / size / stride. think of it as stl iterators, there's no real point to be passing the container itself around.

however, you can still do that, if you really want.

just `typedef DYN_ARR_OF(widget) widget_array;` and now you have a name-able type, and can even have dynamic-arrays-of-dynamic-arrays (`DYN_ARR_OF(widget_array)`).

exitb · on Feb 28, 2023

Capacity may not be equal to size.