Hacker News new | past | comments | ask | show | jobs | submit login
C++: Vector of Objects vs. Vector of Pointers (bfilipek.com)
65 points by joebaf on June 20, 2014 | hide | past | web | favorite | 50 comments



I was wondering about this very issue the other day, and wrote a little C program to try something similar. I tried three things:

1. Create an array of random ints, (sort them, not that it matters), and then time how long it takes to compute the sum.

2. Create an array of ints (data) and an array of int* (ptr). Set ptr[i] to point to data[i]. Scan the data through the ptr array and compute the sum. Obviously there is very good locality of access to both arrays.

3. Same as #2, but first sort the ptr array so that * ptr[i] <= * ptr[i+1]. There will be good locality on ptr, not so much on data.

For arrays of size 10M, #1 and #2 cost about the same, and #3 was 5x slower. (This was on my MacBook Pro, using XCode.)

But my main point here is about Java, which has been my main language for many years. I don't care (much) about its verbosity, its lack of closures, etc. I can get my work done. But working with large heaps has proven very problematic, both in avoiding GC overheard, and in the effects of non-locality.

While you can have both true int arrays and Integer arrays, the same is not true of non-primitive types. There is no way to have anything like an array of structs, with the same compact representation, as you can have in C. There are benchmarks showing that Java and C are competitive in performance. But C/C++ can be used to preserve locality for arrays of structs and objects in ways that Java cannot. Java has to be slower in these situations.

There are well-known tricks to address this problem. E.g. use two arrays of ints instead of an array of Objects which encapsulate two ints.

For these reasons, I would be very happy to see an idea like Value Types added to Java: http://cr.openjdk.java.net/~jrose/values/values-0.html


"While you can have both true int arrays and Integer arrays, the same is not true of non-primitive types. There is no way to have anything like an array of structs, with the same compact representation, as you can have in C. There are benchmarks showing that Java and C are competitive in performance. But C/C++ can be used to preserve locality for arrays of structs and objects in ways that Java cannot. Java has to be slower in these situations. "

Java does not have to be slower in these situations. The fact that something does not exist in the source language is irrelevant.

You can rearrange, restructure, and generally do whatever you want in a compiler as long as you preserve the original semantics or can prove they don't matter to the program.

For things like you are talking about, for non-primitive accesses, often the non-primitive types are locally split up in a function, each piece treated like a separate part, and then put back together where necessary (if necessary, in fact).

Given large parts of the program, you can also do this interprocedurally as well (it's trickier though, to know when it is safe).

For arrays of whatever, it may be internally transformed into the same contiguous representation you see in C.

(ie struct of arrays transformed into array of structs, which has been done for a very long time. It became famous because it sped up SPECcpu's 179.art by a very very large factor).

Besides that, data layout optimizations are pretty common in compilers (as is loop rearrangement of various sorts. For heavy locality optimization, algorithms like http://pluto-compiler.sourceforge.net/ exist)

So in short, it does not have to be slower. It often is for other reasons, but it does not have to be. It all depends on how much time you are willing to trade optimizing code vs running it.


Implicit with the term "Java" is also the JVM, and the JVM has significant restrictions with regard to memory layout of arrays. I'm not sure any compiler can get around that.


Like what?

I'm honestly curious, the only docs i see say the opposite:

"To implement the Java Virtual Machine correctly, you need only be able to read the class file format and correctly perform the operations specified therein. Implementation details that are not part of the Java Virtual Machine's specification would unnecessarily constrain the creativity of implementors. For example, the memory layout of run-time data areas, the garbage-collection algorithm used, and any internal optimization of the Java Virtual Machine instructions (for example, translating them into machine code) are left to the discretion of the implementor."

From the JVM spec: http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-2.htm...

I also know of static java compilers that do mess around with this.


>correctly perform the operations specified therein

Not sure how you're reading the JVM docs to the contrary, as the actual JVM instruction set differs quite a bit from a normal CPU.

The JVM has limited operations for calculating memory addresses for arrays. Typically, this is a great thing, as it means that all memory accesses are bounds checked and an entire class of memory errors are eliminated.

However, if you're trying to do what geophile is trying do with memory layout, you want a calculation like

  i * (sizeof(struct) + base_pointer
But with Java the sizeof(struct) can only be one of the base types. And then you have to pop memory addresses and treat them as different things, etc. And even if you store things as byte arrays and perform an extra multiplication and add on your own, you then have to collate bytes and cast, which involves lots and lots of jvm operations.

With special JVM support, it may be possible, but it's not something that is the least bit natural. And you'd want some sort of standardized support so that you could ensure that your JVM would support the special translation of 10 instructions into a single machine instruction every time, basically the Value Types proposal, I guess.


I think you may be confused. At the point at which you compile this, you are no longer dealing with JVM bytecode, but your own IR (translated from JVM bytecode), and can do whatever you want. Thus, you are not limited by what the JVM has in terms of instructions.

(and in general, you never are. You are only limited by correctness of program execution)

If you want to support it at the bytecode level, yes, what you are suggesting is correct, but it doesn't have to be done this way at all, and rarely is in any language. In general, having to muck up source-code to get optimal data-layouts is a non-starter. Second, the actual optimal data layouts are rarely what programmers think they are due to cache/other complications (it's usually some very complicated nested struct layout with explicit packing and unions in various places)

In any case, i believe my base point stands: There is nothing that stops java from being fast in these cases, because you don't need any support from bytecode to do it.


More importantly, even if a compiler does optimize that, you can't rely on that: next version of the compiler could have slightly different conditions that trigger the optimization and your code could run 5x slower for no obvious reason.


Compiler bugs and regressions are a fact of life, it's why we have regression tests :)


> For arrays of whatever, it may be internally transformed into the same contiguous representation you see in C.

This is a bit tricky because Java and C semantics are not the same, and an optimizer is obliged to prove that the difference does not matter before allowing the transformation. (It could also be given license by some kind of annotation.)

Sharing of mutated fields and Java's == are the major obstacles to doing this transparently, as far as I know. They turn out not to matter if there are no other references to the objects which you want to embed in the array, but proving that (in all the cases in which you would want it to apply) is a bit tough.


I was not aware of PLUTO, thanks for the pointer (no pun intended). If you have used it could you share more. I am particularly interested in how different is the speedup compared to accessin polyhedral loop optimization from inside gcc (or g++). GCC has interface to Cloog and graphite, that seem to target the same features.

With that out of the way, "can be" is a very weak statement. Almost anything "can be", what matters more is "does it now".

@DannyBee Reply appreciated.


Pluto's basic algorithm was implemented in ISL, and can be used by graphite. There are patches to do it. The optimizer i last remember in graphite was very basic. So you'd see much much larger speedups if you used pluto

Graphite in GCC is, AFAIK, semi-abandoned. The research and future work was moved to Polly and LLVM (at least, most of the developers were hired and put to work on it). I don't know who still develops graphite for GCC, if anyone.

In Polly, Polly supports both the ISL version of the algorithm and actually, linking to Pluto, but i don't have any performance numbers handy.

"With that out of the way, "can be" is a very weak statement. Almost anything "can be", what matters more is "does it now". "

IMHO, not really. Sure, in the sense that it's not useful if nobody ever does it. But if this was the performance problem a lot of people were having, it'd be fixed. So in that sense, it matters a lot more what can be.



I'm not sure how you measured your programs, if you measured them as a whole, then the random generation and sorting probably dominated the run-time. On my machine #1 is 3x faster than #2 for just the summing part of 10M 32-bit integers.

Code: http://ideone.com/xIV3f9


Sorry I wasn't clear, I just measured the scan time, after loading and sorting the arrays.


Then my results disagree with yours. Do you have the source code to your benchmark ?


By the way, I typically see #2 slightly and consistently faster than #1, which doesn't make sense to me. But I wouldn't expect a 3x difference in favor of #1.



The compiler is entirely eliminating your loop because you never do anything with the sum variable. Also, signed integer overflow is UB so you should use unsigned integers or compile with the non-standard -fwrapv option.


I'm not convinced of this. I modified the code to do something with the sums. #2 is now slightly slower than #1, but the absolute times and the ratios are otherwise about the same.

http://pastebin.com/vbUSdKGc


With your modified code I am now seeing #1 run about 3x faster than #2. I also still strongly suspect you weren't measuring anything at all before your modifications. Benchmarking is tricky and its hard to say if you are really measuring what you think you are. This is especially true with UB in your loop. It would be best to separate the program into 3 and then investigate the assembly.


Also make sure you are compiling with optimizations on.


Now working in an Ubuntu VM, since I know the tools better than I do XCode. I reduced TRIALS to 20, to reduce testing time. I also replaced summation with xor. My current code is here: http://pastebin.com/QcNu2A69

Here is what I see with default optimization:

    jao@minizack:/tmp$ gcc -std=c99 x.c
    jao@minizack:/tmp$ ./a.out
    INTERNAL time/scan (usec): 29608.000000		xor: 1722783152
    EXTERNAL UNSORTED time/scan (usec): 26688.105263		xor: 1722783152
    EXTERNAL SORTED time/scan (usec): 279771.105263		xor: 1722783152
    SORTED/UNSORTED: 10.482989
and with -O3:

    jao@minizack:/tmp$ gcc -std=c99 -O3 x.c
    jao@minizack:/tmp$ ./a.out
    INTERNAL time/scan (usec): 3636.421053		xor: 1722783152
    EXTERNAL UNSORTED time/scan (usec): 12114.263158		xor: 1722783152
    EXTERNAL SORTED time/scan (usec): 242141.263158		xor: 1722783152
    SORTED/UNSORTED: 19.988113
The INTERNAL time is 8x faster. EXTERNAL UNSORTED is 2x faster. EXTERNAL SORTED hasn't sped up much at all. My conclusions:

- With -O3, the difference between INTERNAL (#1) and EXTERNAL UNSORTED (#2) is close to what you are seeing.

- Loops cannot be optimized away in this version.

- The fact that -O3 speeds up EXTERNAL UNSORTED by 2x but EXTERNAL SORTED only a little makes sense if the latter is slow due to cache misses.


Well, benchmarking without optimizations enabled isn't very useful :)


You can hack together an "array of structs" object using unsafe, but the amount of code involved when using Java is unreasonable. It was manageable in Scala, but this is no reason to introduce Scala to a non-Scala codebase.


C# has value types and is Java like.

I assume you know of it and have your reasons, just couldn't help but point it out.


Even without value types, Java/C# can do probably much better on the pointer vector case than C++, thanks to allocation from contiguous region and automatic memory defragmentation. It is very likely all the objects referenced from the vector would be placed in a contiguous memory region which is easy to prefetch for the CPU (sure, still not as fast as inlining them into array). This is completely different than in C++ where subsequent dynamic allocations return pointers scattered around the whole heap, especially after long run.


It is bad for performance that Java doesn't have structs, but, you can work around this somewhat using byte buffers:

http://www.kdgregory.com/?page=java.byteBuffer


In the case of arrays, you can also have parallel arrays. E.g.:

  class Something {
    public int foo;
    public int bar;
  }

  new SomeThing[100];
becomes:

  int[] foos = new int[100];
  int[] bars = new int[100];
And write appropriate getters/setters.


Why couldn't java do #2? It sees you allocate the array it could preallocate or reserve a block of memory for the objects that are about to fill it.


In the general case, you need to be able to put arbitrary objects into the array, regardless of how they were allocated. You also need to be able to keep track of null elements.


I feel like we are recovering from a 20-year programming bender of heap allocations and dynamic dispatch. The "scripting" languages, Java, and crappy C++ books somehow convinced us that it was OK not to know what anything actually IS in memory, and use some clever tricks (vtable, dicts) to figure it out at the last minute. It's a great solution to certain problems but it's really NOT necessary for huge swaths of programming.


On the contrary, I would apply the 90/10 rule and would say that you only need to know what actually IS in memory when you have to care about it (and it becomes a performance problem). And for the other 90% of the situations, I'll take the development speed/convenience.

Also, as an example, most JITs (including the JVM one) optimize through vtable lookups (so you can inline a whole series of polymorphic calls that might boil down to a constant value). That's an example of an ease of dev vs performance tradeoff (polymorphic methods) that's been alleviated in an automated way by tools.


Then again, there are countless websites which fail to outperform a comparable desktop app written in C++ 15 years ago, on today's powerful computers, which is sad.


That's what I'm arguing though. I think its way less than 90/10. 50/50 at most. Its not just a performance thing. Polymorphism moves the burden of reasoning about types from compile time to run time, and complicates the programmers mental model of the code. Sure, the use of interfaces appears to separate out the differences between types into clean capsules, but in practice understanding much of the polymorphic code ive worked on still requires a mental process of "OK, if it's THIS type then this happens; if its THAT type then something else happens." And then look at dynamic languages where loads of unit tests do the work of the type system. I'm not saying its never useful but its usefulness is over hyped. Many of these cases can be replaced by small localized changes at compile time or a one time cost of copying data into a canonical form.


> Polymorphism moves the burden of reasoning about types from compile time to run time,

How? Doesn't this depend on the specific kind of polymorphism/implementation of which, rather than being about polymorphism inherently? What if types are erased at compile-time and the compiler generates functions and such for all the relevant types that you need?

> but in practice understanding much of the polymorphic code ive worked on still requires a mental process of "OK, if it's THIS type then this happens; if its THAT type then something else happens."

I think this is impossible with parametric polymorphism - if you say that the input data can be any type, you can't make any assumptions about what type it might be. If the incoming type only has the constraint that it can be tested for equality on other values of the same type, they only function you can use on it is ==, and so on.


I think there is something missing in the discussion of the article: Shared pointers are used for shared ownership, where in the example the author is assuming unique ownership.

In the benchmark this requires the compiler to generate code that makes sure that the object was not changed by issuing an exclusive request for the data plus the additional pointer dereferencing.

To check what is really happening I recommend two things:

1) Look at the generated code and see what the compiler issues. 2) Compare object* vs shared_ptr<object> 3) Compare unique_ptr vs object vs object*

My assumption is that unique_ptr will be faster than shared_ptr and have the same speed as object* and that will be a bit slower than object.


You're completely right about shared_ptr not being the right tool for this use case.

However, I haven't tried it but just looked up the implementation of std::shared_ptr in GCC 4.9.0, and accessing it is exactly the same as with a raw pointer. operator-> and operator* just return the internal pointer and do nothing else.

There's some slight space overhead associated with shared_ptr, since it has to store the two counters for shared and weak references, so that might influence the runtime. unique_ptr should behave exactly like a raw pointer.

That said, I'd love to see some benchmarks about this as well and would be even happier to be proven wrong.

Another aspect I find interesting about this is that as the objects grow, it makes sense to store them in column-wise (one array/chunk of memory just for member1, one for member2, ...) instead of row-wise.

I don't know any languages besides Q that make this way of storing data easy/the default.


I thought that the main point of the article was about memory locality. If that is so, then the kind of pointer (auto vs. shared) is a distraction, isn't it? (My experiment was intended to be about locality only.)


If you have not read, Modern C++ Design by Andrei Alexandrescu then you are doing your self a disfavor if you are interested in performance. While its a bit old I guess, the Loki library is/was a wonderful C++ library with very interesting and useful techniques that give the programmer control while preserving things like type safety.

The design here may also benefit from a custom allocator so that objects of same type are constructed in memory in a contiguous fashion. Especially if you want more than one vector of these objects in the code. The above book also has a number of smart pointer constructs that are probably similar to these pointers but with customizable behaviors and strategies.

As someone else mentioned when code hits the real world things tend to break down but if you are not following a good memory design to begin with you don't even have the hope of getting the better performing version.


One clarification: "Just by moving data around, by carefully storing it in one place, we can get huge performance speed up."

should really be:

Just by moving data around, by carefully storing it in one place, and by not doing anything else that messes up caching or consumes CPU cycles we can get huge performance speed up.

A microbenchmark like this will always show results that are unattainable in practice; the work in real code is likely to take considerably more time than the fetch.


To be fair, if you have your data setup in such a way that you know what is coming next, prefetching is a very viable strategy.

That is, this might not be as unobtainable as you seem to be implying. There was a great presentation by someone in the gaming community about this. I can't find it right off, but at a glance this[1] article seems pretty good. And is basically this idea.

[1] http://gameprogrammingpatterns.com/data-locality.html


> There was a great presentation by someone in the gaming community about this.

You're probably thinking of Tony Albrecht's fantastic "Pitfalls of Object-Oriented Programming" talk which is referenced by (and inspired!) the chapter:

http://research.scee.net/files/presentations/gcapaustralia09...


That is indeed what I saw. Thanks for posting!


It's really a very poor example of real-world behavior. There are cases where pointers will perform competitively against objects, and sometimes shine.

Vectors are awesome because of contiguous memory allocation. Vectors are also horrible because of contiguous memory allocation. Really, the rule is that you have to understand what is going on beneath the data structure abstraction. When your vector starts resizing, when you don't know beforehand how many elements will be in it - that's when you start to consider using pointers instead.

In the very first line of code, he reserves space for `count` variables - not something you're always able to do. (When you don't reserve the space beforehand, the vector is created with some initial size, and then if it exceeds it, all pre-allocated storage for the vector must be copied to a new location w/ the new size - rinse and repeat as needed.)

Obviously there are always better data structures more suited to your needs, in this case a deque might come in handy (it's quite the underrepresented STL data structure).


I'm surprised the version with pointers was only about 2x slower, not 10x slower as some people bashing Java often suggest. And this despite the fact that shared_ptrs in C++ are generally heavier than references in Java/C#.


I suspect this is because he only benchmarks small data sets, even the largest one used easily fits into the CPU cache. It would be interesting to see the number of cache misses for each benchmark, perf stat should be able to record that data.

And even the shuffling doesn't really simulate a realistic distribution in memory, it'd be better to also allocate random memory block in between the objects.


And even if shuffling did that, comparisons to java still would be invalid, because Java has completely different algorithms for placing objects in memory which tend to bring referenced objects close to each other.


That was my reaction. I mean, let's think about all the overhead the shared_ptr adds:

1) Layer of indirection

2) Non-continuous memory

3) shared_ptr overhead itself as you mentioned.

All that and only 2x slower? For many applications performance of this particular operation is a rounding error of a rounding error.

This tells me not to worry about optimizing my iterations across datasets prematurely.



At least he's using std::make_shared<> rather than allocating twice for each object.

Though it would be interesting to see what sort of difference there is using make_shared<>(...) vs shared_ptr(new ...).

And also what difference there would be when the vector needs to be resized and the objects are noexcept movable or need to be copied.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: