Linked lists have more consistent performance than std::vector. And taking half ...

sillysaurus3 · on Jan 24, 2015

Wait a sec. You can't bust out numbers like that without being clear about a few key points:

1. What was std::vector's removal algorithm like? Are you sure that .remove() wasn't shifting all elements in the vector? The "correct" removal algorithm in that case looks like this: Swap the target element to the end, and then decrease the vector's size by 1. But if I remember correctly, std::vector's remove method doesn't work like that, so of course it will have terrible worst-case performance if it was shifting all elements of the array when you remove the first one.

2. What's causing the worst-case slowdown? Are you sure you weren't simply seeing the effects of the L2 CPU cache being invalidated?

A linked list will thrash the CPU cache constantly, whereas vectors thrash the cache much less since the elements are stored in contiguous memory. For a modern game running on a modern CPU, vectors are almost always the better choice for that reason.

EDIT: I'm not sure your numbers are trustworthy. For example, there's no way it rendered 35859663 frames in 100 seconds. That would be 358,596 FPS.

Assuming your FPS was actually something like 358, not 358,596, that's still an extremely high FPS. When your FPS is so high, it becomes difficult to accurately measure the reasons for worst-case performance. A sudden spike within a single frame could be for any of a dozen reasons, such as CPU cache eviction, a GPU pipeline stall, or some other program on your computer taking CPU resources.

ANTSANTS · on Jan 24, 2015

>Assuming your FPS was actually something like 358, not 358,596, that's still an extremely high FPS. When your FPS is so high, it becomes difficult to accurately measure the reasons for worst-case performance. A sudden spike within a single frame could be for any of a dozen reasons, such as CPU cache eviction, a GPU pipeline stall, or some other program on your computer taking CPU resources.

As an aside, programmers need to stop using FPS and other inverse units of measurement when performance tuning. It's much easier to reason about 2.79~ milliseconds than it is to reason about 358 FPS. For instance, assuming a single threaded application, let's say I know my game simulation alone can be updated at 125FPS, while my rendering code can be updated at 30FPS. How many FPS does my game run at when I run both together? You'd be hard pressed to figure that out without converting the two to 8 and 33⅓ milliseconds per tick, respectively, making the combination of the two obviously 41⅓ milliseconds, and thus about 24.2FPS.

Yes, your comment about performance measurements under lights loads not accurately reflecting performance under greater loads is true, but 2.79 milliseconds really aren't so far from the standard target of 16.66 milliseconds per frame than the 358 vs. 60FPS comparison would make one guess.

falcolas · on Jan 24, 2015

> You can't bust out numbers like that

Nobody is preventing you from writing your own benchmark and "busting out [your own] numbers". :)

That said, yes, my deletion algorithm is naive, and I welcome a pull request to fix it.

https://github.com/garrickp/intrusive_benchmark

> Are you sure you weren't simply seeing the effects of the L2 CPU cache being invalidated

Perhaps, but I doubt it's the cause of the worst case scenario, particularly since I'm doing identical loops over the physics and renderable objects, and as you say the linked list should have the worse performance.

EDIT: One more small point - if the programmer is relying on the order of the items at all, the "tail tuck" method of deletion would not work.

EDIT[2]: Addressing your edit - the numbers are only for the action of applying physics and adding/deleting objects - the code should reflect that. There is no actual rendering time or AI or anything else which would prevent us from seeing upwards of 100,000 "frames" per second.

The times are also fairly stable (the standard deviation of max times is around 20%) between runs.

on Jan 24, 2015

[deleted]

sillysaurus3 · on Jan 24, 2015

Well, your comment led with "Linked lists have more consistent performance than std::vector." But the tests just don't support that conclusion, and the truth is the opposite in a real-world environment: lists will kill your performance.

Before you'll see any downsides of list, you'll have to introduce memory fragmentation into your test and ensure that your list's allocations end up all over your address space, like they would be in a real engine. At that point you'll see a huge slowdown compared to vector due to CPU cache thrashing.

sillysaurus3 · on Jan 24, 2015

I apologize that my comment probably came off as a bit too upfront. I need to work on that.

I was only saying that it isn't really a good idea to present numbers like that without having a clear idea of what's causing the worst-case performance. Glancing over the code, and assuming the first bullet fired is always the first one to impact the ground, it looks like the code is always going to run std::vector::erase on the first element of the vector, which will always shift the 499 remaining elements. It's possible that's the reason for the worst-case performance, rather than vector being inherently a worse choice than list.

It was very counterintuitive and weird to realize that vector is almost always a better choice than list due to how powerful the CPU caches are. CPU caching is so good that taking full advantage of caching will usually outweigh algorithmic tweaks, like whether you're using a list or a vector. So the power of vector is that its elements are in contiguous memory, whereas the downfall of list is that they usually aren't.

It's difficult to prove that's true except in real-world scenarios. Unless a test is designed to introduce memory fragmentation, the cache performance penalty of using list won't really show up in the results. All of the list allocations will be pretty much contiguous in a test environment, so the tests will show performance on par with vector. But in the real engine, you'll be left scratching your head why everything's running so slowly, when the answer is because the elements are sprayed all across your memory space and it's causing the CPU cache to stall when they're accessed...

EDIT:

There is no actual rendering time or AI or anything else which would prevent us from seeing upwards of 100,000 "frames" per second.

When your FPS is higher than, say, 1,000, it becomes very difficult to pin down the reasons for performance fluctuations. An engine under load behaves differently than an artificial test, so you'll need to measure the performance impact in a real engine, especially when the framerates in the test are so high.

Your monitor is capped at 70 fps, or some high-end monitors can do ~130. Something like 100,000 fps isn't an accurate reflection of the behavior of the real system. Most of the discrepancy is due to memory fragmentation causing CPU thrashing, but your specific test might be running into some other discrepancies, like:

- clock() might not be a high resolution timer, so the worst-case timing might be due to discrepancies in the clock mechanism itself, especially at 100,000+ FPS. See http://stackoverflow.com/questions/6749621/high-resolution-t...

- since your test is running for 100+ seconds and your timing measurements are taking into account an entire frame, some frames are going to run interrupts. If an interrupt takes a long time to process, that load will show up in your timing measurements. The result is that vector (or list) will be unfairly pinned as the reason for the performance issue, when interrupts were the cause.

I'm not confident about the second one, but the point is that there are all kinds of conflating variables.

falcolas · on Jan 24, 2015

> So the power of vector is that its elements are in contiguous memory

In either case, you'll end jumping all over memory for the actual items in either a std::vec or a linked list, since they are both holding pointers to the actual objects (and based on how the nodes are being created, the linked list containers are closer in memory to the nodes).

That said, my premise was not that linked lists were faster to iterate over, it's that they offer a consistent addition/removal time, which is frequently better for a game engine than access with huge deviations.

There are many games out there which boast 100+ FPS on my gaming computer. However, they still look like crap because they will have intermittent drops to 1 or even .5 fps. I (and many others) would prefer a constant 50 (or 30) FPS than a stuttering 140.

Real time OSes offer the same commitments: constant gaps between program resumes, even if that constant delay is large.

sillysaurus3 · on Jan 24, 2015

Okay, there's some kind of failure to communicate, and the frustration level is rising. The failure is probably on my end, so I apologize.

In either case, you'll end jumping all over memory for the actual items in either a std::vec or a linked list, since they are both holding pointers to the actual objects (and based on how the nodes are being created, the linked list containers are closer in memory to the nodes). That said, my premise was not that linked lists were faster to iterate over, it's that they offer a consistent addition/removal time, which is frequently better for a game engine than access with huge deviations.

CPU caching doesn't work like that. When you access any pointer, the CPU cache will automatically cache 256 bytes from the access address. So if your object is at address "x", everything from [x, x+256) will be put into the CPU cache.

Your Projectile class contains 6 floats, so each Projectile is 24 bytes. When you're using std::vector, all of your projectiles are stored contiguously in memory. That means when you access a projectile, you'll automatically cache (256 bytes / sizeof(Projectile) - 1) = the next 9 projectiles. So you get a bunch of performance for free, simply by using vector.

When you use list, you lose all of that caching. The reason is because list's memory allocations are all over the place. When you use list, the memory region [x, x+256) is solely that one projectile. That means your CPU will end up caching a bunch of irrelevant bytes of memory. That also means the CPU will have to cache each projectile individually as they're accessed. You're trading automatic caching of 10 projectiles for having to cache each individual projectile. Since taking advantage of CPU caching gives you 10x or 100x speedups, that's a very bad tradeoff.

Addition/removal from a container is a tiny, tiny fraction of tininess of what a modern game engine does. It's so tiny that add/removal performance is almost always irrelevant unless you're adding/removing a huge number of elements every frame, which is rare. Iteration performance is almost the only thing that matters for a game engine, because every single frame you'll be iterating over hundreds of containers. If each iteration causes your CPU cache to thrash, then you're in for a world of pain.

The takeaway is that list is an awful choice for any realtime game, and it's dangerous to give the impression that it's better. If someone were to take your tests at face value and start using list in their own game engines, a few months later they're going to be wondering why their performance is so terrible and why their profilers aren't showing the cause. As far as performance characteristics go, memory fragmentation / CPU caching matter so much that discounting them is a recipe for disaster.

There are many games out there which boast 100+ FPS on my gaming computer. However, they still look like crap because they will have intermittent drops to 1 or even .5 fps. I (and many others) would prefer a constant 50 (or 30) FPS than a stuttering 140.

Agreed! But list vs vector isn't the cause. A lot of the stuttering has to do with loading resources from disk, flushing the GPU pipeline, or other I/O intensive actions. On a modern CPU, add/removals from containers aren't the reason for the slowdown.

Either way, good luck, and keep on making cool stuff.

quotemstr · on Jan 24, 2015

> When you're using std::vector, all of your projectiles are stored contiguously in memory

The OP's point is that we're not talking about

  std::vector<projectile>

but about

  std::vector<projectile*>

The latter thrashes the cache, but it's necessary of an object needs to belong to several lists at the same time.

In either case, enlarging the vector also has a significant cost.

sillysaurus3 · on Jan 24, 2015

That's true, but in modern game engines, objects of the same type share an allocation pool. So a bunch of Projectiles should still be in contiguous memory, since they're using the same pool allocator.

At that point, vector is still a better choice because it prevents the CPU from accessing a bunch of intermediate list nodes which will pollute the CPU cache. If you use a vector, there are no intermediate objects, so less stuff winds up in the cache. There's also less memory fragmentation overall.

As long as you're using a vector implementation that grows by a factor of 2 every time it expands, then there shouldn't be any significant cost of enlarging the vector. If it's just storing pointers, it won't invoke any copy constructors. It can even be optimized to a straight-line memcpy, which is blazingly fast on a modern CPU.

quotemstr · on Jan 24, 2015

> That's true, but in modern game engines, objects of the same type share an allocation pool.

I don't think pool allocation guarantees memory contiguity: what if we have a pool with two elements free, those elements being the first and last ones, and we allocate two elements and stick pointers to them in our vector?

> a vector implementation that grows by a factor of 2....shouldn't be any significant cost

It depends on what you mean by "cost". I agree that the enlargement cost is acceptable from an overall CPU budget sense, but each enlargement can still cause a latency spike, since the cost for enlargement is still large (especially considering the cost of taking the heap lock) for _that_ insertion. Sometimes consistency is more important than throughput.

> vector is still a better choice because it prevents the CPU from accessing a bunch of intermediate list nodes which will pollute the CPU cache

Who said anything about intermediate list nodes? The nice thing about an intrusive linked list is that once you pay for accessing an element's cache line, the rest of the accesses are nice and local. There are no intermediate nodes, and all costs are smooth and consistent. Besides: you can issue a memory prefetch instruction for node B while you process node A, further hiding latency.

While a vector is value types might be best in most situations, if you can't use that, an intrusive linked list may or may not be better than a vector of pointers.

sillysaurus3 · on Jan 24, 2015

Good points! I love tech chats like this. There are a few caveats:

A pool should be trying to make new allocations as contiguous as possible. That's accomplished by wrapping the allocations. For example, let's say we have a pool with slots A, B, C, D, E. We allocate A through D, then C is later freed. The pool shouldn't put the next allocation at C. It should be placed at E. That way, the objects are still roughly sorted by creation time: A is younger than B, which is younger than D, and so on.

(The next allocation should go in C after that, because there are no more slots. But by that time D and E might have been freed, so it's still in sorted order.)

That way, if you access any given element, it will cache the next element in memory. Due to the sorted nature of the allocations, that means even if you're iterating over them using some other container like a list or vector of pointers, accessing an element will still likely cache the next iterator's element. Not guaranteed, but likely.

About the vector enlargement: Usually the cost is nil, because as long as the vector stays under 1MB, you can simply preallocate large blocks of memory at startup for each type of vector. Therefore the cost of expansion is just a few arithmetic instructions for bookkeeping purposes, not anything expensive like a heap lock.

Good point about the intrusive list. I personally wouldn't like it because it makes the code more complicated, but as long as you're using a pool allocator, the memory characteristics of an intrusive list shouldn't be too different from the scheme I described above.

flohofwoe · on Jan 24, 2015

Dynamic arrays like std::vector must be tweaked for a special use case as well, for instance it is quite essential to pre-allocate the array to a capacity of the expected number of entries using std::vector::reserve() so that no excessive re-allocation takes place while the array grows. Also special care must be taken if the objects have an expensive copy operation (that's where the new move-semantic in C++11 may help a lot).

sillysaurus3 · on Jan 24, 2015

It's not so much that dynamic arrays are the problem as std::vector specifically is the problem. Many game engines implement their own version of vector which is designed to grow by power-of-2. That way if you add 100 elements, the vector has already reserved 128 slots. If you add 129, it'll automatically reserve 256. Etc. Automatic power-of-2 growth turns out to have very nice performance and memory characteristics.

std::vector itself might work that way, but since the implementation can be different on different platforms, it's usually a better idea to write your own when performance matters.

barrkel · on Jan 24, 2015

I doubt there are any vector implementations that don't use exponential growth. Without exponential growth, push_back doesn't have constant time, and adding n elements doesn't take O(n) time. Since push_back is required to take amortized constant time, an implementation without exponential growth (or some kind of magic) wouldn't be standard conforming.

Whether the exponential growth is in the form of doubling or not is just a constant factor. There's nothing magical about using powers of two. Depending on the accounting used by the allocator you're using, there may be synergies, or there might not be - some allocators require a few bytes more than the allocation request, and this requirement might inflate the underlying memory request more than necessary.

DCoder · on Jan 25, 2015

> I doubt there are any vector implementations that don't use exponential growth.

Trivia fact time: Westwood Studios wrote their own vector implementation for the original Command&Conquer, and reused it for later games in the same series, until Renegade/Generals. That implementation uses linear growth, capacity grows by 10 elements each time.

barrkel · on Jan 25, 2015

Indeed; however that's not conforming to the standard, so it is not an implementation of std::vector ;)

sebleblanc · on Jan 25, 2015

But it is an implementation, albeit a non-standard one.

sillysaurus3 · on Jan 24, 2015

Whether the exponential growth is in the form of doubling or not is just a constant factor. There's nothing magical about using powers of two.

Actually, powers-of-two reduce memory fragmentation for all objects of the same size. Since memory fragmentation is one of the central causes of performance problems in modern large projects, it's really important. Firefox spent a bunch of time realizing their performance problems were due to an incredible amount of fragmentation.

Exponential growth gives you most of the benefit, but specifically growing by a factor of 2 gives you the rest.

barrkel · on Jan 25, 2015

You're not saying anything I don't already know.

If the overhead of an allocation is h bytes, and your allocator stores it in e.g. the prefix of the memory allocation, then all your allocations will be for 2x+h. And this will not have nice effects for your memory fragmentation. In this case, you'd be better off allocating 2x-h each time.

Of course, modern allocators tend to store accounting information using separately reserved arrays of records to avoid exactly this problem. But if you use a language like C# or Java, you come back to around to this problem. Without using a double indirection for array data, object header and array length are stored contiguously with array data. So allocating an array with a power of two will again not be a power of two for the underlying allocator. Much less of a problem with precise GC, but not all GC'd languages use precise moving GC, and some languages let you pin arrays for interop with C. So it's something to be aware of.

flohofwoe · on Jan 24, 2015

I agree! It's just that C++ people from outside the gaming industry always give me the evil look when I start ranting about how useless the std containers actually are ;)

jokoon · on Jan 24, 2015

already read that article about ILL, did not make much sense to me.

Where do you explain why the vector is slower ?

falcolas · on Jan 24, 2015

I'm not saying they're slower, I'm saying the performance is more consistent.

99.9% of the time the std::vec implementation will out preform the linked list. However, that 0.1% seems to be wildly worse than the linked list. This is not a problem for most programs, however in gaming engines that hiccup shows up as a visible hitch in the framerate.

jokoon · on Jan 24, 2015

So to you, that 0.1% is specific to gaming engines, how ? I don't really believe it.

I think "that 0.1%" demonstrates that linked list are really irrelevant and that vectors should be used all the time.

The only useful advantage of linked lists is to be able to insert an element in the middle of the list. I don't understand why anyone would want to insert data in the middle of a list, you rarely want to do that in O(1), and even if you do, you still need to know where exactly you want to insert it.

If you want to have data sorted, use map<>, if you want a faster map, use a hash map.

Maybe lists are relevant for vertex geometry, because it might be interesting to quickly edit and add vertices, but it's such a specific case I don't think it matters. Nobody really edits vertex geometry. You could even just build an index vector instead.

falcolas · on Jan 24, 2015

I don't believe it's problem endemic only to game engines, but game engines are particularly vulnerable to such latency. Google reveals many examples of how hitching causes problems for players.

I'm not going to claim that memory is the primary, or even a remotely major cause of such hitching, but it's certainly not going to help when you have to spend one of the 16 milliseconds given to render a frame on memory allocation and re-ordering.