
Joint Allocations in C++ - ingve
https://turingtester.wordpress.com/2015/09/13/joint-allocations-in-c/
======
Galanwe
Sadly, this is representative of the new trend of C++ programmers. I will get
down voted, hated, and laughed at, but I do prefer the original version of the
code. It is infinitely more simple and readable, and the layers of abstraction
added are of pretty much no value here.

If I stumbled across the latter version, I would just scratch my head. And
I've seen sooo many projects just end up in a big code bloat just because some
programmers wanted to add tons of "new" features like this that add nothing
and just complexify the code for no reason.

~~~
quicknir
Post author here. Sorry you feel that way. I'm surprised you don't think that
moving out the messy code into make_contiguous is a win. Isn't it good to
factor out clearly reusable code into functions, so you can reuse it? Would
you really prefer writing out code like that in a dozen places for a dozen
classes that needed contiguous storage, to writing one function, testing it,
and then calling it from all those other places?

~~~
yassim
Hi, Another Game dev here. Firstly, neat article, thanks for writing it.

I'm curious as to the problem you're exploring/solving here?

"Concatenated struct's in single chunk" is a neat trick. The use case seems a
bit broken[1], but I'm assuming that it's from the original video, and your
c++-ifying it? Or at least trying to find a nicer way to write this style of
code?

If you'r going this route, I'd be tempted not to use pointers at all, and just
use offsets and accessors.[2]

    
    
      * Lack of pointers means you can load in 1 read. (endian needs to be watched, and assumes you can get the entire size elsewhere)
      * You can also shuffle it around in memory should you be doing something fancy. (defragging heaps might be useful if doing a sandbox)
      * Better encapsulation? maybe..
    

But the basic point I'd like to make is, there is no nice way to write this
type of code, because c and c++ gives us no nice[3] way of expressing it.

[1] Verts and Indices are generally uploaded to a GPU then discarded, but this
has the counts, which is needed CPU side for draw calls, in the same chunk so
would need to be copied before freeing.

[2]

    
    
      struct Mesh 
      {
      	int32_t num_indices, 
      		num_verts;
    
      	// inline to this struct.. 
      	// assumes Vector3 is aligned 4 bytes.
      	// and that Mesh has been alloced to same alignment.
      	//
      	// Vector3 positions[num_verts];
      	// int32_t indices[num_indices];
      	// Vector2 uvs[num_verts];
      
    	inline const Vector3* PositionsBegin() const { return reinterpret_cast<const Vector3*>(this + 1); }
      	inline const Vector3* PositionsEnd() const { return PositionsBegin() + num_verts; }
      	inline const int32_t* IndicesBegin() const { return reinterpret_cast<const int32_t*>(PositionsEnd()); }
      	inline const int32_t* IndicesEnd() const { return IndicesBegin() + num_indices; }
      	// etc
      };
    

[3] Maintainable, fast at runtime, low mental friction to people other than
the author, etc

~~~
quicknir
Hi, thanks yassim. Yes, the use case is from the original article. Yes, that
is certainly another way to go, you could just provide the unique_ptr and a
bunch of integer offsets. In this post, my data structure is trying to balance
performance and code cleanliness, it's advantageous to provide standalone
ArrayView which can't just be two integers. In the follow up post I'll be
trying to make the data structure itself generic, so it will more easily allow
for space optimizations.

I guess we'll agree to disagree, I think that C++ gives us some nice ways of
writing it, I think that's what the post shows :-)

------
jzwinck
I watched the linked original video to understand the motivation, and read the
article, and I think this is a misguided solution to a real problem.

These "joint allocations" seem to be "a small pool of heterogenous types." The
motivation is that heap allocations are expensive, so we want to coalesce
them. The only way this would matter at all is if we are creating lots of
these Mesh objects.

The given solutions make one heap allocation per Mesh object, as opposed to
three. Create six million Meshes and you get two million allocations.

Instead, one pool should be created for each of the three array types:
Vector2, int, and Vector3. Now, create six million Meshes and you can do
somewhere between three and a few dozen allocations, depending on your pool's
initial size and growth strategy.

~~~
jasode
_> , create three million Meshes_

I didn't watch the presentation so I may be wrong but it's possible you're
misinterpret how Mesh was intended to be used.

The "million" objects is meant to be _inside_ one Mesh. Mesh is singular. The
vertices are plural (millions).

If there's more than one Mesh, it would be dozens or hundreds of Mesh (per
video game characters).

~~~
jzwinck
But if you only ever create a single Mesh, then you wouldn't care to coalesce
the allocations, because three allocations instead of one is inconsequential
in the course of an entire program lifetime. I agree that there are many
vertices and edges within each Mesh, but there better also be tons of Meshes
within one program, or there is no problem to begin with.

~~~
svalorzen
I believe that the main problem is not allocation time (games already have
loading times, that's not a problem), but fast access and not breaking the
cache.

~~~
jzwinck
The title of the first slide in the video is "Heap allocations are expensive!"
And using a pool the way I'm proposing does not fragment the heap. It might be
slightly worse if you serially touch a lot of Meshes which have only a few
vertices and edges each, but this is a corner case and a trade-off anyway.

------
octo_t
Or you could just use C++11/14 allocators instead which fix this in a much
more transparent way...

~~~
humanrebar
Exactly. Stateful allocators were added to the standard library especially for
this sort of problem. Before, allocators were assumed to have no internal
state, which made it problematic to manage memory from an arena or pool.

Perhaps there's a reason allocators wouldn't work here, but if so, it deserves
a little discussion.

~~~
quicknir
I did discuss stateful allocators. They're nice, I'm a big fan. But as I said,
they would impose pointless space overhead of one extra pointer per array.
Also, you now have more complicated issues: each of the arrays is now an
owner. What if you try to make a copy of it? Move construction? What if the
container tries to deallocate, how does our allocator handle that? When I copy
Mesh, would I need to change the allocator's state in the copies? How would
that even work? I think that once you think about trying to accomplish the
very specific and relatively simple thing being done here with allocators,
you'll see that it would be quite a bit more complexity to no benefit.

I think I did give it a "little" discussion :-). Guess it depends on your
definition of a little. I didn't want to talk more about it because I didn't
want to get sidetracked, and this is ultimately the way I chose to do it.

Hope that sheds some light on the post and the choices I made with it.

------
uxcn
I was recently bit using a similar technique to the original code for an
intrusive type in C. Using manual non-typesafe raw offsets for things can
definitely lead to nasty bugs.

Automating the calculation of offsets and assigning pointers definitely
eliminates a lot of potential bugs, but I do wonder if this isn't a bigger
deficiency in C++. Why force storing an extra pointer per _array_? I am still
not sure why C++ doesn't allow FAMs [1].

There's probably a question of which would be more efficient, storing pointers
in the object and calculating the offset from the pointer, or dynamically
calculating the offset within the object. Still, it doesn't seem like a
problem developers should need to solve.

[1]
[https://en.wikipedia.org/wiki/Flexible_array_member](https://en.wikipedia.org/wiki/Flexible_array_member)

~~~
quicknir
It's not really an extra pointer, unless you assume there are no alignment
issues. If you assume that there's zero padding between members, then yes, you
can do 1 pointer for storage + 1 per view. In the follow-up post, I plan to
either do one pointer per view, or one integer per view, haven't decided
which.

The thing is, who should solve it? The different approaches you listed have
different advantages, there isn't one right answer. If the language itself
solves it for you, you are stuck with whatever solution the language picked.
This would be fine in a higher level language, but not in C++.

You're right though that individual devs shouldn't be solving it, it should be
in a library. If there's enough interest in these posts, I'm happy to put up
my work (fully fleshed out and documented) on a github for people to use.

~~~
uxcn
I suspect the compiler will be better able to optimize offsets than pointers
just because of the semantics required by the language, but I think you're
right that it isn't necessarily one size fits all. Another possible solution
might even be storing static bounds at specific intervals. Without numbers, I
can only guess which would give optimal performance though.

I think the library approach is right. It would be nice to see the language
transparently support contiguous array members, but supporting it in a library
will allow more people to use it regardless of compiler. It avoids the
standards approval process and implementation as well, which take non-trivial
amounts of time. The 0x/1y features definitely make it a lot more feasible.

It would probably give people a better chance to play with the source if you
could post a link to it somewhere. I'm not sure what you plan to license it
under.

I'll keep an eye out for the next post.

------
makecheck
If you ever start from something like this, you should be asking a lot of
more-basic questions first, such as:

1\. _Why_ is the goal to put everything in one block with a single allocation?
Could everything still work if they were separated?

2\. What do Vector2 and Vector3 look like? What if they say "int a, b" and
"int a, b, c" respectively? If a perfectly-aligned "new int[total]" would have
fixed the problem, it should have been tried from the start and the code
refactored accordingly to not necessarily use structure types to look up the
data.

3\. Conversely, are Vector2 and Vector3 complex classes with special
constructors (or equally important, _could they someday become that way_ )? If
so, even a "fixed" memory-allocation solution will be equally fragile to
maintain because there is a responsibility to ensure that constructors are
called correctly.

4\. What is the memory profile of the rest of the application, e.g. how many
Mesh objects themselves are created and _is the entire approach to managing
Mesh objects wrong_? Maybe the focus on optimizing one piece has missed an
entire problem somewhere else that is more fundamental.

Clearly the original code has bugs but it is also only 14 lines, the fixes for
the bugs are straightforward and the right solution (after other analysis) may
well have been to remove the code entirely instead of doing the same thing in
a different way. Beware of the tendency to "fix" things without looking more
deeply at the actual problem.

------
tubs
Vertex attributes should probably be interleaved anyway.

~~~
ygra
The whole point is that they shouldn't, because there are multiple loops over
the different attributes and for best performance they should be in individual
contiguous memory regions. Remember that you only have a few milliseconds per
frame in a game and those things can matter here.

~~~
tubs
The two attributes are pos and uv, which are almost always tied to each other
(and therefore be updated together). The GPU will access attributes by index
hence they should be interleaved otherwise you are just forcing cache misses.

Indices are _not_ vertex attributes and thus should not be interleaved with
the other data.

