
Bulk Data Structures C++ - kouh
https://gamasutra.com/blogs/NiklasGray/20190724/347232/Data_Structures_Part_1_Bulk_Data.php
======
doctorpangloss
Game developers like me go through stages of grief in reinvention of memory
management.

In this case, what will eventually be reinvented is an arena allocator.

Having just researched this, Cap'n'Proto is a good implementation of one that
suits game development needs: (1) flexibility, (2) no serialization
representation for networking and AI, (3) mutability of primitives, (4)
garbage collection of stale objects in lists (i.e. removed items) is manual,
(5) constraints to prevent non-performant design, and (6) support for these
performance-sensitive idioms in multiple languages, not just C++.

Migrating to an arena allocator is a completely different can of worms...

~~~
gpderetta
Exactly. Also using a vector as underlying storage instead of sets of fixed
size memory chunks seems not ideal to say the least.

~~~
krona
Except that std::vector can be specialized with custom arena allocators, which
isn't uncommon in performance sensitive applications.

~~~
gpderetta
std vector needs to be countiguous, so you cannot use a segmented storage.

~~~
krona
Most applications need more than one vector. Management of this with custom
allocators pays dividends fairly quickly.

~~~
gpderetta
Sorry, there is some confusion here. The OP is literally implementing a custom
allocator. I'm saying that using std::vector for the backing store of your
allocator is wrong because resizing will invalidate your already allocated
objects (which is a no-can-do in a general purpose allocator) and the copying
is wasteful. A custom fixed size chunk based allocator woud be better.

------
slimscsi
> std::vector uses constructors and destructors to create and destroy objects
> which in some cases can be significantly slower than memcpy().

This is precisely what vector::emplace() solves, and std::move should be
faster than swap and pop. Modern C++ has changed a lot, this article ignores
the massive improvements added in c++11,14,17.

~~~
kcbanner
In most C++ game engines the standard library is almost never used, for
performance reasons.

See:
[https://github.com/electronicarts/EASTL](https://github.com/electronicarts/EASTL)

~~~
gpderetta
My understanding the primary reason for he developers using custom libraries
is not so much performance but a) historically console compilers and
expecially standard libraries have been extremely buggy and b) is good to have
a single implementation across platforms instead of having to deal with quirks
and implementation divergence.

~~~
lasagnaphil
From what I've heard, there are two more major reasons to not use STL for
gamedev.

\- Debug build performance. Release builds of C++ code using STL are generally
pretty fast, but Debug builds suffer a lot (especially Visual Studio's
std::vector implementation is notoriously horrible for debug builds). Debug
executable speeds matter when you are debugging a game; you don't want to test
your first-person shooter in 1 FPS!

\- Build speed. Because of heavy use of templates and historical cruft, STL
slows down your build times a lot. The build-test cycle is very important when
designing games; you don't want to wait for a few hours after you've changed a
few lines of code to tweak a new feature. Gigantic distributed build servers
alleviates this problem a bit, but they are pretty cumbersome to set up
nonetheless.

------
daemin
Not necessarily a fan of this sort of re-blogging so here's the original link:
[https://ourmachinery.com/post/data-structures-part-1-bulk-
da...](https://ourmachinery.com/post/data-structures-part-1-bulk-data/)

~~~
sourthyme
Our Machinery has a lot of great resources and recommend reading when you have
time.

------
stephc_int13
Reading all the discussions and visible confusion about the best C++
practices, when and where and when a constructor will be called etc. seems to
be the perfect illustration of the author point.

------
zenogais
I wanted to like this article because I'm been thinking about this a lot in
the context of game development, noticed a few things. One thing I'll say from
briefly playing with this - the code leaves lots out a looks ostensibly
simpler than it really is. Would very much appreciate tips / pointers on this
or a more fleshed out and working implementation of the code.

For the bulk data with holes code:

First, there's an initialization step that has to happen the first time you
allocate your bulk_data_t. Namely, you need to iterate through every item in
the list and set its next_free item to the item following it, looping the last
item back around to zero. You also need to do this for all the items between
the new size and old size every time you resize your item list.

Second, safe iteration over all of the bulk data doesn't seem possible without
adding some sort of flag to indicate whether or not an item is free.

Am I missing something here?

~~~
dhruvrrp
I would say it depends on how one might want to handle it, like when you
create an item_t you set next_free = -1 as a flag to indicate that it is not
free. And have bulk_data_t's 0th position's next_free be 0.

~~~
zenogais
Update: I was indeed missing something.

I think I've figured out roughly what the author intended, code below [0].

First, it looks like he's relying implicitly on data stored in std::vector.
Namely vectors have both a capacity and a size. The capacity is total number
of allocated elements. The size is the total number of elements stored
actually stored.

Second, vector::resize won't reallocate until it runs out of capacity, but it
will give you access to extra elements if you need them. So this is used to
lazily re allocate while bumping up the size of the vector.

Both of these effectively make it "do the right thing" by leaning on the
vector storing both size and capacity.

If you hand manage those values yourself you can get a pretty compact C
implementation without a lot of code.

One last thing: Using a union here for the item_t is pretty much guaranteed to
get you a segfault. The whole thing should really be a struct. This also
allows for setting next to sentinel value if necessary.

[0]: C code for bulk_data_t example:
[https://pastebin.com/Tfcdt39h](https://pastebin.com/Tfcdt39h)

~~~
dhruvrrp
The author's pseudo-code implies that bd->items is full already, since it is
probably initialized to the initial number of items that are added. This is
probably for memory optimization for games. Also explains why resize increases
the size by 1, just enough memory to add the new item.

This way you don't need to worry about keeping track of size and capacity
either.

The reason I suggested -1 is because when we iterate through bd->items, we
need a way to know if it's a valid value or just "holed".

~~~
zenogais
> This way you don't need to worry about keeping track of size and capacity
> either.

The only reason you don't have to worry about this is because std::vector
handles it for you, at least in the code examples provided by the author. If
you choose to go with a pure C implementation (which is what I'm trying out)
then you will have to keep track of these.

> The reason I suggested -1 is because when we iterate through bd->items, we
> need a way to know if it's a valid value or just "holed".

Yep, I was able to get an example using -1 as a sentinel working and passing
fuzz testing.

------
ball_of_lint
Although it's a fair amount of work, you can make it very simple to switch
between SoA and AoS by writing a child class for a C++ vector<yourclass> that
templates your original class, but returns values of a child class of
yourclass that operates on the SoA data.

With a public-data-heavy class that might run you into a performance problem
with allocating the extra unused memory, but you can always pull out the
interface as a virtual parent of both to avoid that as well.

I would rarely be afraid of using SoA over AoS if it can lead to significant
performance improvements. Done well it can hide all the complexity with some
clever use of interfaces and classes.

~~~
typon
Can you give an example?

~~~
ball_of_lint
This is mainly to illustrate the idea; I don't claim any correctness or good
performance from this code. (if you do inserts after reading a [] you may
invalidate some pointers!)

[https://pastebin.com/aZWTAL2J](https://pastebin.com/aZWTAL2J)

impl_X is your base class with most of your logic. interface is used to pull
out just the parts of the data that you might work with while wanting to have
it in SoA format. Then we specialize the vector template for the interface to
give us a dummy class with the things we need, but that sends our writes back
to the backing array.

If we need to get an individual struct out of it the conversion is automatic.
If we just need to access some member vars it will (hopefully) optimize down
to direct accesses. We do bear some complexity in implementation, but it's all
confined here.

I'm now realizing I was a bit imprecise in my earlier comment. the specialized
vector is not around <yourclass> but around an interface parent of your class.
You could also just specialize yourclass vector, but then you don't have the
ability to switch.

~~~
ball_of_lint
After writing that up, I saw below that someone else has done it much like I
had envisioned and ironed out the odd parts.

Better source:
[https://github.com/crosetto/SoAvsAoS](https://github.com/crosetto/SoAvsAoS)

~~~
typon
Thank you

------
std_throwaway
I assume that we want to access these arrays as "array of structs" for most
functions but as "structure of arrays" for some calculation intensive
functions. The article suggests storing it as array of structs and to make
copies for those calculations but this seems inefficient to me. Modern C++
should provide a way to efficiently decouple the access model from the memory
layout.

Can we hide the actual memory layout without big overhead using C++
inline/template functions/classes? Would that be the visitor pattern?

~~~
plopz
Doesn't the memory layout actually matter for cache locality? So you would
still need to be able to have both memory layouts for performance.

~~~
std_throwaway
Cache locality is kind of the whole point of it. Some algorithms benefit
hugely if you choose a specific layout.

Other algorithms do some kind of random access to a few fields only and they
don't benefit at all. Those algorithms can make up 90% of your code but only
account for 10% of the computation. Therefore it would be easier to have your
data look like a AoS in 90% of your code but actually be stored as a SoA to
gain the speed in 90% of the computation.

------
degski
There is already a good solution:
[https://www.plflib.org/colony.htm](https://www.plflib.org/colony.htm), that
will [eventually] end up in the std
[[https://github.com/WG21-SG14/SG14/tree/master/SG14](https://github.com/WG21-SG14/SG14/tree/master/SG14)].

------
person_of_color
I really need a resource on how to make code cache friendly (or at least, more
aware of computer architecture). Got an interview coming up at a HFT firm.
Please HN, deliver!

~~~
westmeal
Check out bisqwits videos on cache locality

~~~
person_of_color
Who is bisqwit?

Couldn't really get anything on Google.

------
saagarjha
> Also, without some additional measures, neither plain arrays or vectors
> support referencing individual objects.

Uh, isn't this just subscripting?

> But, as stated above, we don’t care about the order.

Maybe std::unordered_set might be what you want?

~~~
einpoklum
Remember `std::unordered_set` is typically rather slow.

~~~
B4TMAN
Can you elaborate why is it slow? Shouldn't it be faster tham
`std::ordered_set` which uses a Red Black Tree as the undelying data structure
thus proviing a O(logn) time complexity on the other hand `std::unordered_set`
uses hash functions to `index` in an array and retrieve which essentially is a
O(1) time complexity.

~~~
kllrnohj
This is where we get into O(1) != fast territory. Algorithmic complexity has a
weak relationship to CPU performance, not a strong one.

If you want to find something in a set storing it as an array and doing a
linear scan will beat a std::unordered_set up to a shockingly large number of
items due to how CPU's work.

In particular it's the pointer chasing aspect of std::unordered_set that
becomes a problem (an issue shared with _most_ hash set implementations).
Remember an unordered_set is not an array of items, it's an array of _buckets_
of items (this is how hash collision is handled). Worse still, those buckets
are usually linked lists. It typically can't be speculated effectively and it
can't be prefetched effectively, so you become memory latency bound during an
un-cached lookup. And memory latency is just shy of absolutely terrible. If
you're expecting L1/L2/L3 cache hits on lookups then you're not dealing with
vary large sizes probably and you're going to get much better cache density
with the flat array than the array-of-buckets.

There are alternative hashsets that are flat and avoid this, but they are less
common and as far as I know no standard implementation on any language uses
such a hash set. There's a good talk about such a dense, flat hash set here:
[https://www.youtube.com/watch?v=ncHmEUmJZf4](https://www.youtube.com/watch?v=ncHmEUmJZf4)

~~~
yxhuvud
Some languages have hash tables (and hence sets, as they tend to be
implemented with them) that use open addressing, in a way that doesn't end up
being bad cachewise unless there are unreasonably many collisions for the same
hash code.

It is also not uncommon for mature implementations to optimize the cases with
few elements in the hash to use linear lookup. In some cases that optimization
is also used while storing data in small hash tables.

Pointer chasing hash tables was good when cache was nonexistant or small (ie
the 90s), but nowadays, open addressing is just superior.

