Hacker News new | past | comments | ask | show | jobs | submit login
Generic dynamic array in 60 lines of C (gist.github.com)
91 points by nice_byte on Feb 28, 2023 | hide | past | favorite | 104 comments



DYN_ARR_RESET should probably be called DYN_ARR_INIT instead, as calling it more than once will leak memory.

The handling of endptr in DYN_ARR_RESIZE seems to be incorrect. If I have an array with 2 elements and capacity of 3 and I DYN_ARR_RESIZE it to 5, I now have an array with 5 elements, 3 of which are garbage values.


~~Isn't the handling of `endptr` incorrect in `DYN_ARR_RESET` as well? `a.endptr = a.data;` feels very incorrect to me.~~

Nevermind, I see how it's supposed to work now.

To your point, `RESIZE` is definitely incorrect. It should be checking the length before calling `realloc`.


Thats just idiomatic C. Set the "pointer to end" to be equal to the "pointer to start".


The values are garbage because they haven't been initialized yet. Other code may write to those locations.


> The handling of endptr in DYN_ARR_RESIZE seems to be incorrect. If I have an array with 2 elements and capacity of 3 and I DYN_ARR_RESIZE it to 5, I now have an array with 5 elements, 3 of which are garbage values.

Why do you say this is wrong? That's exactly what I would expect.


> Why do you say this is wrong?

Exposing uninitialized memory as valid values tends to be a really bad idea.


that's the intended behavior.

this somewhat mirrors what happens with a std::vector when you call resize() except of course c doesn't have default ctors. the point of resize is to make exactly "size" elements of the dynamic array addressable.


DON'T USE THIS

As fpoling points out, capacity is a 32-bit unsigned. That can overflow. There is no safety in append: if capacity is zero no new space will be made, if capacity overflows the realloc will not have enough space -- in either case, you end up writing past the end.

Allocating a [capacity] zero array and appending to it is extremely common. That you'll write past the end in that case shows that the author has barely even used this.

The code is unacceptably bad and unsafe. I don't usually do this, but I'm going to flag this post. I encourage everybody to do the same.

Apologies to the author. Golf is fun. This is uncool.

edit: changed size to [capacity]


Is this overflow really a realistic scenario? I imagine most machines won't even permit you to allocate the sorts of sizes that would cause this bug.

In that scenario (where you're allocating a multi-gigabytes array), you typically know the size ahead of time, instead of growing it dynamically, as the latter doesn't really perform too well.


2Gb (1<<31 + 1 = boom) just isn't that much memory these days. If you're parsing large text files, for example, it's really easy to blow that limit. And what's the point of this dynamically-growing array if not to forget about managing capacity?


I have been using this for years. Capacity is not size. A zero size array will have non zero capacity. All c code is unsafe.


Where do you check that capacity is nonzero? When capacity is zero, what happens on this line?

     a.capacity <<= 1u;
"all c code is unsafe" is not an excuse to permit bloody obvious, undocumented memory overruns.

I write a lot of c. Avoiding the unsafe bits, avoiding UB, is the skill required to write good c. "C code is unsafe" is a Rustacean marketing slogan. Don't believe it, but definitely don't practice it.


just don't initialize it with 0 capacity :-)


Why not just fix it? It's a trivial fix and has the added benefit that a zero-initialized DYN_ARR_OF(x) struct would be in a valid state, which is always nice. A struct with several dynamic arrays can then much simpler to initialize, for example.


For a library, if the user input is wrong, throw an error or an assertion failure. It shouldn't be causing an uninformative segmentation fault which your library does. The fix is a trivial. Why so defensive?


I don't like the use of macros for things like this. Macros in general should rarely or sparingly be used.

I'm also not certain one should use "end pointers." Conventionally, it seems more advisable to use `size_t capacity`, `size_t length`, and `void *data`.

Great use of Cunningham's Law, though! I appreciate C posts on Hacker News.


One should always use size_t to keep lengths. Reallocating the array invalidates all pointers but does not invalidate offsets.


Great point, thank you. I've never actually heard it expressed this way before, but now I have another reason to prefer explicit lengths.


A concrete example that I've been working on recently: a parser/lexer backed by an input buffer. Its state looks like this:

  struct parser {
    struct {
      unsigned char *buffer;
      size_t capacity;
    } buffer;
    struct {
      size_t read;
      size_t write;
    } position;
  };
The lexer makes progress by consuming bytes from the buffer, incremeting the read position. The buffer is reallocated whenever such a read would overrun the write position: I/O functions are called to fill up the buffer and push the write position further ahead.

Had they been pointers within the [buffer, capacity[ range, they would have become invalid whenever buffer is reallocated for expansion since its location in memory would change. So I chose to implement the read and write heads as offsets to the buffer's base address and calculating those pointers only when they're needed.


Amusingly, Ward Cunningham denies inventing Cunningham's law and I feel compelled to correct that potential misunderstanding.


How meta...


the alternative in c is something like glib's array container which hides types. I dislike that more.


Yes, but GLib arrays are also lossy. They require you to externally manage capacity or to know with perfect foresight exactly what your capacity will be for the lifetime of the memory allocation.

I don't think those are impossible scenarios, but the cost of one additional pointer in terms of size leaves you with a lot more functionality. Saving one pointer in size and not having capacity makes GLib arrays nearly useless, which I find confusing.

You could simply pass a pointer and a size around instead. Which is what most people actually do when they don't need resizable data layouts.

If you're working with C, I think you have to just accept that void pointers happen. Working around losing compile-time type data requires you to create runtime structures, which I don't find acceptable.


What does it actually gain by using macros instead of proper functions? The only generic macro that can't be written in function is the one use `type`, but saving the type size in the struct is enough for this kind of code.


I think this author should learn about the "do {} while(0)" trick.


What's the benefit of do {...} while (0) vs just {...} ?


You can put it inside a braceless if/else statement and follow it with a semicolon.

Compare:

  #define FOO() { ... }
  #define BAR() do { ... } while (0)
In the former case:

  if (x)
    FOO();
  else
    ...;
  
  // becomes
  if (x) {
    ...
  }
  ;
  else
    ...;
In the latter case:

  if (x)
    BAR();
  else
    ...;
  
  // becomes
  if (x)
    do {
      ...
    } while (0);
  else
    ...;


The latter makes the macro usable without requiring a semicolon, which some people don't like; the former will cause a syntax error if the semicolon is missing, so they feel that it regulates syntax a bit better.


Works if you want to e.g. skip braces if if..else... statements. Some would say putting braces in always is good style, but I like skipping them myself so won't defend it :)


I agree it’s good style, but surrounding code using bad style is still poor excuse for writing macros which break.


do {…} while (0) is a statement, so it can be put between “if (…)” and “; else”, while just {…} would result in a syntax error due to a hanging else.


OK, but I guess what's really being relied on there is that it's syntactically still a single statement when you follow it with ';', whereas {...}; is two statements (which is what breaks the if-else).

I just discovered that gcc also supports non-standard "statement expressions" of the form ({...}) which would serve the same purpose at the cost of portability.


This seems pretty sensical to me except that it achieves its small size by sacrificing error checking and handling, even in the (aiui) non-exceptional case of realloc() returning NULL. This is of course classic C program behaviour so perhaps that's fine ;-)


Heavy use of macros to do metaprogramming is a strong sign it's time to move to a more powerful language.


On one hand that's obviously true, on the other hand it's just 60 lines of code, which is easier to check and audit than (for instance) a typical std::vector implementation - I don't know what the equivalent is in D, please forgive my ignorance :)


A typical std::vector implementation supports a lot more things, so it's not an apples-to-apples comparison. It's like saying that a horse-drawn carriage is a lot easier to repair than a motor vehicle. It's true, but doesn't really say much.


I don't have this luxury if I want to learn about ffmpeg libraries, its all written in c, almost all media and hardware accelerator libraries are written in c, you go to another language they just bind to c. its soo frustrating but I've no choice but to stick with c.


Most languages have the ability to call C functions. Use the language that helps you write your code and convert to call the third party API.

For example, a C++ Vector provides a generic container and the code can still call C functions with the underlying array.

This is in the "Doctor it hurts if I do X" bucket.

https://stackoverflow.com/questions/2923272/how-to-convert-v...


Yes, but if you're writing code that you want to be bindable to other languages, you essentially have to follow the C ABI, in which case, writing in C is a natural choice (yes, you can expose a C ABI from other languages, but that's usually not natural and you sort of have to understand C anyway to do that).


Unless your API surface constitutes the majority of your code, writing in C still doesn't make sense, because you're paying the tax every time you have to manually juggle strings and whatnot. C++ is a more sensible choice for this - you can still write C-compatible headers (and it's trivial to verify in builds - just include it in a .c file and see if that compiles!) while using STL etc in the implementation.


Yeah if you're using opaque handles (where what you have is really a pointer to some C++ thing), this can work well.

If you're exposing structs that are passed back and forth for state, it's a bit more cumbersome.


Mixing high level C++ stdlib classes with low level C APIs is often not exactly trivial, for instance most C++ containers expect that their items are RAII compatibel, and the resulting memory corruption errors will be harder to debug than plain old C code because the shit hits the fan deep inside some completely unreadable C++ stdlib implementation code.


huh? What's wrong with using the higher level languages if they take care of binding to the C parts for you.

(I know these are far an in-between for ffmpeg, too many leaky abstractions.)


Usually increased build system complexity, outdated manually maintained bindings, etc etc... - in some languages (like Zig) this is really trivial, in other languages it can be much harder.


try to find ffmpeg in go or rust that is < 5 years old and complete. how about intel libva,


D code can call C functions directly.


C + a set of macros is a more powerful programming language.


Also a far worse language though. Macros have their place, but trying to do anything complex with them just turns into a total nightmare. They're hard to reason about, walk through, or modify.

If heavy macro usage is found, it's definitely time to reconsider the approach.


It really depends on how macros are used. If you're just using them to implement "high level language features", it's not a problem; sure, you might have trouble figuring out what STAILQ_INSERT_TAIL does internally, but you're going to have just as much trouble figuring out what the Lisp or Perl or Python "add this item to the end of that list" operations do internally.

Macros can be a nightmare, but when they're used properly they're not.


> when they're used properly they're not

Yeah, they are.

Source: Decades of C programming


You're not the only person with decades of experience.


We all suffer from this, so it's not personal, but one is too often blind to the shortcomings of one's own code.


You must be aware that I'm not the author of BSD sys/queue.h.


If were me I'd use inline functions instead of macro's


C is powerful enough.


Lisp?


When it comes to macros, accept no substitute.

Granted, Lisp already has arrays, so not much reason to reinvent the wheel.

  (make-array '(2 6) :initial-element 0 :element-type '(unsigned-byte 32))
  ;; => #2A((0 0 0 0 0 0) (0 0 0 0 0 0))


Yes! :-)


Yeah that's a great opinion and all but generally C is used because it HAS to be used these days. No one is going to use this code for building a web application or will be tossing it into legacy code.


awesome! i do the same thing for arrays[1] and maps[2].

stuff like this is great when you are trying to find the performance ceiling of some workload in c/cpp. literally nothing to hide.

1. https://github.com/nathants/bsv/blob/master/util/array.h

2. https://github.com/nathants/bsv/blob/master/util/map.h


The code has integer overflow even on 64 bit CPU due to using uint32 for capacity.


I like that the comments mention another two implementations of a similar idea and all are a bit different.

Here's one I've written a few years ago: https://github.com/kiryk/mlisp/blob/master/vector.c

You may find it dirty, but it can be wrapped using macros, and used both on LHS and RHS like in "string(v, i) = ch"

(the macros BTW) https://github.com/kiryk/mlisp/blob/71d028738d2f9607fa7ce8c1...

(and a usecase) https://github.com/kiryk/mlisp/blob/71d028738d2f9607fa7ce8c1...


Not an entirely uncommon idea. I've written one.

There's also a well-known one here, in klib: https://github.com/attractivechaos/klib/blob/master/kvec.h


Probably many have written a vector header like this. It baffles me why this one reaches the front page of HN given that the implementation is not that great. A few problems (some have been mentioned by others as well).

1. not using "do {} while (0)". This may lead to compile errors.

2. using uint32_t for capacity. On 64-bit machine, this doesn't save memory.

3. DYN_ARR_RESIZE() may have a quadratic time complexity if some is calling DYN_ARR_RESIZE(a, 10); DYN_ARR_RESIZE(a, 11); DYN_ARR_RESIZE(a, 12) etc in a loop. kv_resize() in kvec.h wouldn't have this problem.

4. Segfault if capacity is 0.


It’s better that the code isn’t perfect. Reading the code review comments in this thread has been really valuable for me personally. Like, I’m not sure I would’ve learned about the do-while trick had so many people not pointed it out.


Who didn't. Almost any C program dealing with strings and collections has to have their own implementation or import one.

Part of the reason why C developers "feel" productive, but can't produce anything of meaningful complexity.


>Part of the reason why C developers "feel" productive, but can't produce anything of meaningful complexity.

https://quoteinvestigator.com/2010/05/17/remain-silent/


You hurt me with the truth.

Anyway, I think this HN posting has this "crazy" flavor of using macros to simulate generics, and that's the specific kind of implementation that I meant, which klib also does.


I think sqlite is fairly complex


Have you ever heard of Linux?


til the Linux kernel has no meaningful complexity


Still waiting for a revolutionary OS written in $virtuous_lang to save us from the scourge of unsafe Linux. They've had since Ada-83 to get something off the ground. Guess they aren't so productive either.


Counterpoint to people suggesting size_t for capacity: we recently changed our in-house std::vector replacement from { T* first, last, end; } to { T* first; u32 size, capacity; }.

Not only did this shave 8 bytes off of each instance but in many cases it saved more because they became 16 bytes which reduced alignment padding in other objects that use SIMD types that require 16-byte alignment.

This saved 40 MB of memory for us in https://warframe.fandom.com/wiki/Orb_Vallis which, given the poverty our mobile minspec, was a terrific savings.

Your needs may vary, but it's 2023 we're still cramming 4GB of poop into a 2GB bag.


In game development, you'll want to do things like this often, but general purpose programming shouldn't adopt this practice. It's literally what the type is for.


yes, i know growth by a factor of 2 has issues under certain usage patterns. those are less likely to be problematic when there is more stuff going on than just growing the buffer - you can use the "hole" for other allocations.


Hard to tease out what you're saying here, but maybe a Hashed Array Table (HAT) is what you're referring to ? I have a slightly extended version here [0][1]

[0] https://rkeene.org/viewer/tmp/hat.c.htm [1] https://rkeene.org/viewer/tmp/hat.png.htm


I think GP meant that his implementation doubles the capacity of the array on resize as opposed to increasing by 1.5 or some other factor.


The biggest problem with this is that you cannot assign, pass, or return an "instance" of the array struct itself. That's because it's declared as an unnamed struct and in C two unnamed structs are not type compatible even if they have identical fields. C23 did improve struct compatibility [1] but unfortunately not for unnamed structs.

[1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3003.pdf


You can if you typedef it.


Clever. Seems so obvious in hindsight.


this one does RAII in C and it's most likely faster: https://bit.ly/3ICOplU


Even this one appears to leak memory on failed `realloc()`.


Any sufficiently complicated C program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of C++.


I dislike this - having the “array” be a struct containing the pointer and size fields makes it easy to “copy” the array such that you get dangling pointers. Similarly there’s no way to track lifetime or ownership of the array.

There are long term ABI benefits to these data structures just being opaque pointers outside of the implementing libraries


no reason for it to ever cross abi boundary.


Seriously, in this day and age, why are we still stuck with C? There are battle tested container libraries in C++ for e.g.


Yep, once you find yourself writing containers that are found in C++ its time to switch to it.

I've seen too many C developers re-write C++ containers in C because they are afraid of C++, its madness.


C++ is annoying because of name mangling and things like static initialization calling constructors.

The result is that you can easily link C code to almost any language, including C++, with almost any linker. But for C++, you usually have to use the linker that comes with the C++ compiler that compiled your library. You can write code in C++ that is as compatible as C, but you have to go out of your way to achieve that, extern "C" is only the beginning.

As a result, when the overhead of using C instead of a more complete language is not too great, I prefer to write my libraries in ANSI-C, for maximum compatibility.


Indeed, C++ is madness. Unsafe iterators, insane template language compared to cpp


The post demonstrated one thing: don’t share your code.

Not to say I think this particular implementation is my favorite, but why anybody should expect a free online code snippet should be bulletproof is beyond me, it is a damn gist, a demo.


This isn't C, this is preprocessor. Tomato tomato, who cares if it is 60 lines or 600.

https://doc.rust-lang.org/src/alloc/vec/mod.rs.html#400

https://docs.rs/containers/latest/src/containers/collections...


https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf

Section 6.10 (Page 145): Preprocessing Directives

Hey, would you look at that! The preprocessor is a mandatory part of the language!


Why use macros instead of inline functions?


Because the API needs to work for random item types. C doesn't have templates (and for stuff like this C11's _Generic isn't helpful because that just maps specialized function names to generic function names, but doesn't help with the generic implementation code).


Inline functions can't be made type safe (or at least, as type safe as C gets) over generic arrays.


  a.capacity <<= 1u;
  decltype(a.data) tmp = (decltype(a.data)) realloc(a.data, sizeof(a.data[0]) * a.capacity);
  assert(tmp != NULL);
That's not ideal. Imagine you are at 32GB capacity, the next realloc will ask for 64GB which is pretty excessive.


This isn't suitable for arrays that large regardless of the resizing multiplier.

realloc() generally allocates a new block of memory of the new size, copies the entire content over, and then frees the old block. Only sometimes you get to be lucky and have enough spare room after the existing data in the virtual address map to not need that copy.

By the point this copying becomes relevant, a continuous dynamic array like this becomes fundamentally the wrong data structure.


My understanding is that this is the typical behavior of dynamic arrays


Only for bad dynamic array implementations.

Good ones might resize by the golden ratio, and enlarge by blocksize if larger


Macros are evil. I will stick to zig at the moment.


data (startptr), endptr & capacity?! Memory might be cheap enough to waste, but more derefs do not help timings.


Three pointers (or two and a size, or one and two sizes) is pretty standard for dynamic arrays. It lets you allocate more than one element a time when repeatedly appending one element.

No checking malloc or realloc though. The anonymous struct is dubious too, though maybe being unable to pass these things to functions is a feature.


functions can work with just pointer / size / stride. think of it as stl iterators, there's no real point to be passing the container itself around.

however, you can still do that, if you really want.

just `typedef DYN_ARR_OF(widget) widget_array;` and now you have a name-able type, and can even have dynamic-arrays-of-dynamic-arrays (`DYN_ARR_OF(widget_array)`).


Capacity may not be equal to size.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: