
C++: Vector of Objects vs. Vector of Pointers - joebaf
http://www.bfilipek.com/2014/05/vector-of-objects-vs-vector-of-pointers.html#.U6R__rdSBXc.hackernews
======
geophile
I was wondering about this very issue the other day, and wrote a little C
program to try something similar. I tried three things:

1\. Create an array of random ints, (sort them, not that it matters), and then
time how long it takes to compute the sum.

2\. Create an array of ints (data) and an array of int* (ptr). Set ptr[i] to
point to data[i]. Scan the data through the ptr array and compute the sum.
Obviously there is very good locality of access to both arrays.

3\. Same as #2, but first sort the ptr array so that * ptr[i] <= * ptr[i+1].
There will be good locality on ptr, not so much on data.

For arrays of size 10M, #1 and #2 cost about the same, and #3 was 5x slower.
(This was on my MacBook Pro, using XCode.)

But my main point here is about Java, which has been my main language for many
years. I don't care (much) about its verbosity, its lack of closures, etc. I
can get my work done. But working with large heaps has proven very
problematic, both in avoiding GC overheard, and in the effects of non-
locality.

While you can have both true int arrays and Integer arrays, the same is not
true of non-primitive types. There is no way to have anything like an array of
structs, with the same compact representation, as you can have in C. There are
benchmarks showing that Java and C are competitive in performance. But C/C++
can be used to preserve locality for arrays of structs and objects in ways
that Java cannot. Java _has to be_ slower in these situations.

There are well-known tricks to address this problem. E.g. use two arrays of
ints instead of an array of Objects which encapsulate two ints.

For these reasons, I would be very happy to see an idea like Value Types added
to Java:
[http://cr.openjdk.java.net/~jrose/values/values-0.html](http://cr.openjdk.java.net/~jrose/values/values-0.html)

~~~
DannyBee
"While you can have both true int arrays and Integer arrays, the same is not
true of non-primitive types. There is no way to have anything like an array of
structs, with the same compact representation, as you can have in C. There are
benchmarks showing that Java and C are competitive in performance. But C/C++
can be used to preserve locality for arrays of structs and objects in ways
that Java cannot. Java has to be slower in these situations. "

Java does not have to be slower in these situations. The fact that something
does not exist in the source language is irrelevant.

You can rearrange, restructure, and generally do whatever you want in a
compiler as long as you preserve the original semantics or can prove they
don't matter to the program.

For things like you are talking about, for non-primitive accesses, often the
non-primitive types are locally split up in a function, each piece treated
like a separate part, and then put back together where necessary (if
necessary, in fact).

Given large parts of the program, you can also do this interprocedurally as
well (it's trickier though, to know when it is safe).

For arrays of whatever, it may be internally transformed into the same
contiguous representation you see in C.

(ie struct of arrays transformed into array of structs, which has been done
for a very long time. It became famous because it sped up SPECcpu's 179.art by
a very very large factor).

Besides that, data layout optimizations are pretty common in compilers (as is
loop rearrangement of various sorts. For heavy locality optimization,
algorithms like [http://pluto-compiler.sourceforge.net/](http://pluto-
compiler.sourceforge.net/) exist)

So in short, it does not _have_ to be slower. It often _is_ for other reasons,
but it does not _have_ to be. It all depends on how much time you are willing
to trade optimizing code vs running it.

~~~
epistasis
Implicit with the term "Java" is also the JVM, and the JVM has significant
restrictions with regard to memory layout of arrays. I'm not sure any compiler
can get around that.

~~~
DannyBee
Like what?

I'm honestly curious, the only docs i see say the opposite:

"To implement the Java Virtual Machine correctly, you need only be able to
read the class file format and correctly perform the operations specified
therein. Implementation details that are not part of the Java Virtual
Machine's specification would unnecessarily constrain the creativity of
implementors. For example, the memory layout of run-time data areas, the
garbage-collection algorithm used, and any internal optimization of the Java
Virtual Machine instructions (for example, translating them into machine code)
are left to the discretion of the implementor."

From the JVM spec:
[http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-2.htm...](http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-2.html)

I also know of static java compilers that do mess around with this.

~~~
epistasis
>correctly perform the operations specified therein

Not sure how you're reading the JVM docs to the contrary, as the actual JVM
instruction set differs quite a bit from a normal CPU.

The JVM has limited operations for calculating memory addresses for arrays.
Typically, this is a great thing, as it means that all memory accesses are
bounds checked and an entire class of memory errors are eliminated.

However, if you're trying to do what geophile is trying do with memory layout,
you want a calculation like

    
    
      i * (sizeof(struct) + base_pointer
    

But with Java the sizeof(struct) can only be one of the base types. And then
you have to pop memory addresses and treat them as different things, etc. And
even if you store things as byte arrays and perform an extra multiplication
and add on your own, you then have to collate bytes and cast, which involves
lots and lots of jvm operations.

With special JVM support, it may be possible, but it's not something that is
the least bit natural. And you'd want some sort of standardized support so
that you could ensure that your JVM would support the special translation of
10 instructions into a single machine instruction every time, basically the
Value Types proposal, I guess.

~~~
DannyBee
I think you may be confused. At the point at which you compile this, you are
no longer dealing with JVM bytecode, but your own IR (translated from JVM
bytecode), and can do whatever you want. Thus, you are not limited by what the
JVM has in terms of instructions.

(and in general, you never are. You are only limited by correctness of program
execution)

If you want to support it at the _bytecode level_ , yes, what you are
suggesting is correct, but it doesn't have to be done this way at all, and
rarely is in any language. In general, having to muck up source-code to get
optimal data-layouts is a non-starter. Second, the actual optimal data layouts
are rarely what programmers think they are due to cache/other complications
(it's usually some very complicated nested struct layout with explicit packing
and unions in various places)

In any case, i believe my base point stands: There is nothing that stops java
from being fast in these cases, because you don't need _any_ support from
bytecode to do it.

------
grundprinzip
I think there is something missing in the discussion of the article: Shared
pointers are used for shared ownership, where in the example the author is
assuming unique ownership.

In the benchmark this requires the compiler to generate code that makes sure
that the object was not changed by issuing an exclusive request for the data
plus the additional pointer dereferencing.

To check what is really happening I recommend two things:

1) Look at the generated code and see what the compiler issues. 2) Compare
object* vs shared_ptr<object> 3) Compare unique_ptr vs object vs object*

My assumption is that unique_ptr will be faster than shared_ptr and have the
same speed as object* and that will be a bit slower than object.

~~~
ah-
You're completely right about shared_ptr not being the right tool for this use
case.

However, I haven't tried it but just looked up the implementation of
std::shared_ptr in GCC 4.9.0, and accessing it is exactly the same as with a
raw pointer. operator-> and operator* just return the internal pointer and do
nothing else.

There's some slight space overhead associated with shared_ptr, since it has to
store the two counters for shared and weak references, so that might influence
the runtime. unique_ptr should behave exactly like a raw pointer.

That said, I'd love to see some benchmarks about this as well and would be
even happier to be proven wrong.

Another aspect I find interesting about this is that as the objects grow, it
makes sense to store them in column-wise (one array/chunk of memory just for
member1, one for member2, ...) instead of row-wise.

I don't know any languages besides Q that make this way of storing data
easy/the default.

------
blt
I feel like we are recovering from a 20-year programming bender of heap
allocations and dynamic dispatch. The "scripting" languages, Java, and crappy
C++ books somehow convinced us that it was OK not to know what anything
actually IS in memory, and use some clever tricks (vtable, dicts) to figure it
out at the last minute. It's a great solution to certain problems but it's
really NOT necessary for huge swaths of programming.

~~~
astral303
On the contrary, I would apply the 90/10 rule and would say that you only need
to know what actually IS in memory when you have to care about it (and it
becomes a performance problem). And for the other 90% of the situations, I'll
take the development speed/convenience.

Also, as an example, most JITs (including the JVM one) optimize through vtable
lookups (so you can inline a whole series of polymorphic calls that might boil
down to a constant value). That's an example of an ease of dev vs performance
tradeoff (polymorphic methods) that's been alleviated in an automated way by
tools.

~~~
blt
That's what I'm arguing though. I think its way less than 90/10\. 50/50 at
most. Its not just a performance thing. Polymorphism moves the burden of
reasoning about types from compile time to run time, and complicates the
programmers mental model of the code. Sure, the use of interfaces appears to
separate out the differences between types into clean capsules, but in
practice understanding much of the polymorphic code ive worked on still
requires a mental process of "OK, if it's THIS type then this happens; if its
THAT type then something else happens." And then look at dynamic languages
where loads of unit tests do the work of the type system. I'm not saying its
never useful but its usefulness is over hyped. Many of these cases can be
replaced by small localized changes at compile time or a one time cost of
copying data into a canonical form.

~~~
Dewie
> Polymorphism moves the burden of reasoning about types from compile time to
> run time,

How? Doesn't this depend on the specific kind of polymorphism/implementation
of which, rather than being about polymorphism inherently? What if types are
erased at compile-time and the compiler generates functions and such for all
the relevant types that you need?

> but in practice understanding much of the polymorphic code ive worked on
> still requires a mental process of "OK, if it's THIS type then this happens;
> if its THAT type then something else happens."

I think this is impossible with parametric polymorphism - if you say that the
input data can be any type, you can't make any assumptions about what type it
might be. If the incoming type only has the constraint that it can be tested
for equality on other values of the same type, they only function you can use
on it is ==, and so on.

------
tfigment
If you have not read, Modern C++ Design by Andrei Alexandrescu then you are
doing your self a disfavor if you are interested in performance. While its a
bit old I guess, the Loki library is/was a wonderful C++ library with very
interesting and useful techniques that give the programmer control while
preserving things like type safety.

The design here may also benefit from a custom allocator so that objects of
same type are constructed in memory in a contiguous fashion. Especially if you
want more than one vector of these objects in the code. The above book also
has a number of smart pointer constructs that are probably similar to these
pointers but with customizable behaviors and strategies.

As someone else mentioned when code hits the real world things tend to break
down but if you are not following a good memory design to begin with you don't
even have the hope of getting the better performing version.

------
al2o3cr
One clarification: "Just by moving data around, by carefully storing it in one
place, we can get huge performance speed up."

should really be:

Just by moving data around, by carefully storing it in one place, _and by not
doing anything else that messes up caching or consumes CPU cycles_ we can get
huge performance speed up.

A microbenchmark like this will always show results that are unattainable in
practice; the work in real code is likely to take considerably more time than
the fetch.

~~~
taeric
To be fair, if you have your data setup in such a way that you know what is
coming next, prefetching is a very viable strategy.

That is, this might not be as unobtainable as you seem to be implying. There
was a great presentation by someone in the gaming community about this. I
can't find it right off, but at a glance this[1] article seems pretty good.
And is basically this idea.

[1] [http://gameprogrammingpatterns.com/data-
locality.html](http://gameprogrammingpatterns.com/data-locality.html)

~~~
munificent
> There was a great presentation by someone in the gaming community about
> this.

You're probably thinking of Tony Albrecht's fantastic "Pitfalls of Object-
Oriented Programming" talk which is referenced by (and inspired!) the chapter:

[http://research.scee.net/files/presentations/gcapaustralia09...](http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf)

~~~
taeric
That is indeed what I saw. Thanks for posting!

------
ComputerGuru
It's really a very poor example of real-world behavior. There _are_ cases
where pointers will perform competitively against objects, and sometimes
shine.

Vectors are awesome because of contiguous memory allocation. Vectors are also
horrible because of contiguous memory allocation. Really, the rule is that you
have to understand what is going on beneath the data structure abstraction.
When your vector starts resizing, when you _don 't_ know beforehand how many
elements will be in it - that's when you start to consider using pointers
instead.

In the very first line of code, he reserves space for `count` variables - not
something you're always able to do. (When you don't reserve the space
beforehand, the vector is created with some initial size, and then if it
exceeds it, all pre-allocated storage for the vector must be copied to a new
location w/ the new size - rinse and repeat as needed.)

Obviously there are always better data structures more suited to your needs,
in this case a deque might come in handy (it's quite the underrepresented STL
data structure).

------
pkolaczk
I'm surprised the version with pointers was _only_ about 2x slower, not 10x
slower as some people bashing Java often suggest. And this despite the fact
that shared_ptrs in C++ are generally heavier than references in Java/C#.

~~~
ah-
I suspect this is because he only benchmarks small data sets, even the largest
one used easily fits into the CPU cache. It would be interesting to see the
number of cache misses for each benchmark, perf stat should be able to record
that data.

And even the shuffling doesn't really simulate a realistic distribution in
memory, it'd be better to also allocate random memory block in between the
objects.

~~~
pkolaczk
And even if shuffling did that, comparisons to java still would be invalid,
because Java has completely different algorithms for placing objects in memory
which tend to bring referenced objects close to each other.

------
halayli
Related talk:

min 46 onward,
[http://channel9.msdn.com/Events/Build/2014/2-661](http://channel9.msdn.com/Events/Build/2014/2-661)

------
daemin
At least he's using std::make_shared<> rather than allocating twice for each
object.

Though it would be interesting to see what sort of difference there is using
make_shared<>(...) vs shared_ptr(new ...).

And also what difference there would be when the vector needs to be resized
and the objects are noexcept movable or need to be copied.

