
Data Structures Part 3: Arrays of Arrays - dmit
https://ourmachinery.com/post/data-structures-part-3-arrays-of-arrays/
======
lpghatguy
A lot of the trade-offs in picking a fixed size array versus a heap allocated
array can be solved with a container that supports small vector optimization!

You see this in C++'s std::string type, which leads some people to use it for
storing binary data. [1] I'm not sure what common STL-ish implementations of
the idea exist.

In Rust, there's SmallVec, which has configurable storage backing and spills
onto the heap when there are too many elements. [2]

[1]
[https://stackoverflow.com/a/21710033/802794](https://stackoverflow.com/a/21710033/802794)

[2]
[https://docs.rs/smallvec/0.6.10/smallvec/struct.SmallVec.htm...](https://docs.rs/smallvec/0.6.10/smallvec/struct.SmallVec.html)

~~~
bullen
What is the difference between this "small" vector and a vector?

~~~
PixelOfDeath
The small vector class contains a static array of a few bytes. And as long as
your data fits in it, no extra heap allocation is needed. A vector always does
heap allocation, even if you only use a few bytes.

~~~
bullen
Ok, thx!

If I use the trick Niklas calls “array with holes” in part 1, will each hole
larger than the cache frame result in a cache miss?

------
zawerf
There was a really interesting post on this topic recently where you have a
circular array of circular arrays of size sqrt(n). [1]

The result is you can do O(1) access, O(sqrt(N)) insert and delete at
arbitrary indices, and O(1) insert at head and tail.

In terms of big O this is strictly better than:

\- arrays: O(1) access, O(N) insert/delete in middle, O(1) insert/delete at
tail.

\- circular arrays: O(1) access, O(N) insert/delete in middle, O(1)
insert/delete at head and tail.

\- fixed page size chunked circular arrays such as the c++ implementation of
std:deque which is still O(N) for insert and delete. [2]

[1]
[https://news.ycombinator.com/item?id=20872696](https://news.ycombinator.com/item?id=20872696)

[2] [https://stackoverflow.com/questions/6292332/what-really-
is-a...](https://stackoverflow.com/questions/6292332/what-really-is-a-deque-
in-stl)

~~~
jbapple
Assuming you ignore the constant. And if you're willing to ignore the
constant, you can get O(c) access, O(N^(1/c)) insert and delete at arbitrary
indices, and O(c) insert at head and tail, for any constant c. The trick is to
make the array into a B-tree of width N^(1/c) and depth c. Note that the
insert/delete time is amortized: the worst-case time is O(cN^(1/c)).

~~~
zawerf
Yea someone else also mentioned that generalization (called a tiered vector)
in that thread:
[https://news.ycombinator.com/item?id=20873110](https://news.ycombinator.com/item?id=20873110)

------
bullen
I'm implementing cache friendly iteration on my open-source multiplayer server
even though it is written in Java because I'm going to need it for my C+
client later.

So I did thread safe compacting from the end which works fine, but now I want
to implement 8-neighbor sending on an array and that can easily be done with
fixed over-sized boxes.

My question now is: if I use the trick Niklas calls “array with holes” in part
1, will each hole larger than the cache frame result in a cache miss?

~~~
corysama
It doesn’t sound like you are mentally modeling cache misses correctly. Any
access to any array (dense or holey) that isn’t already in cache will result
in a miss. A goal is to minimize misses by taking advantage of locality and
prefetching.

Let’s assume you have an array of objects that are small enough that 4 can fit
in a cache line.

You take advantage of locality by densely packing them into an array so that
you can process 4 adjacent (and aligned) items for the price of only 1 cache
miss. The array with holes can reduce your utilization by mixing live and dead
data in a single cache line. Worst case would be 1 live item followed by a
hole that means the next 3 in the cache line are dead bytes. That cache line
only buys you 1 item to process. But, if the hole continues for several more
cache lines it’s shouldn’t matter because you should have a means in place to
skip all of that dead data and go straight to the next live item.

You take advantage of prefetching by processing your array linearly from start
to finish so that the CPU will notice and start fetching cache lines further
down the array before you even try to access them. Arrays with holes can mess
this up by making your access pattern irregular. If your array is very sparse
you will be accessing cache lines without a repeating pattern and the CPU
won’t be able to figure out how to help. But, the goal is for your array to be
fairly dense (or at least reasonably defragmented) so that you end up
accessing long, continuous spans of cache lines that make the prefetcher
happy. At a minimum, the goal is to do better than most multilevel/tree-based
data structures.

Check out the writings and presentations of Martin Thompson (“mechanical
sympathy” guy). He has made a career of getting good performance out of Java
by coding it in the style of embedded-device C programs.

~~~
bullen
Ok, thanks, I suspected this would be the case... I actually use an int array
with id's in my iterations so 16 per cacheline... I can only compact them from
the start and end of each box (geographical square) on array edge enter/leave
because of thread safety. I will size the boxes at what my client can render
probably at ~900 per box since C+ skin mesh animation wont allow more and
prune with priority (friends first, same country second etc.). But now I know
how to implement it thanks to Niklas and your comment.

------
known
Learning arrays of arrays from Perl is easy
[https://perldoc.perl.org/perldsc.html](https://perldoc.perl.org/perldsc.html)

