
Tough Times on the Road to StarCraft - striking
http://www.codeofhonor.com/blog/tough-times-on-the-road-to-starcraft
======
jokoon
What's the point of using linked lists ? Low memory ? I'll never understand
why the usage of linked lists is so common.

I mean you don't have fast random access with linked list, aren't hash maps
just better for fast insertion/deletion ? Object pools are also pretty great
too.

I'll never forget the day a classmate in some game programming school argued
that linked list were like vectors (std::vector).

Linked lists only seems viable for insertions, they're not viable for fast
deletion since you need to search for the pointer first. Maybe you need to
index linked lists instead, but since the heap already allows itself to be
fragmented like a linked list can, wouldn't it be better to just use vectors
as often as possible so it can be more cache friendly ?

~~~
falcolas
Linked lists have more consistent performance than std::vector. And taking
half a second to render one frame when you're averaging 100fps is way worse
than consistent 50 FPS.

There was recently an article about intrusive linked lists, and this same
question came up. I took the unreal linked list library, and wrote a similar
std::vector implementation, and tested it. The test was simple, fire 50,000
bullets at a rate of 500 rounds per second, and remove them when they fall 2
meters. They existed in 3 separate lists: physics objects, renderable objects,
and all objects. When they hit the ground, they were removed from all lists.
Capture the clock ticks to generate each "frame", and do a few calculations on
them at the end.

The average and median operation times for std::vector were indeed a fair bit
faster (at -o3, at no optimizations the ILL implementation was faster).
However, the worst case for std::vector was... much worse. More than 1000x
worse in some runs. One test run showed the worse case taking over two
millisecond for a single frame of just manipulating vectors!

    
    
        $ ./benchmark_ill
        Minimum: 0.000000
        Maximum: 0.000094
        Average: 0.000005
        Median: 0.000005
        Total Frames Rendered: 18968118
        Total Run Time: 100.663895
    
        $ ./benchmark_v
        Minimum: 0.000000
        Maximum: 0.004813
        Average: 0.000002
        Median: 0.000002
        Total Frames Rendered: 35859663
        Total Run Time: 100.664009

~~~
sillysaurus3
Wait a sec. You can't bust out numbers like that without being clear about a
few key points:

1\. What was std::vector's removal algorithm like? Are you sure that .remove()
wasn't shifting all elements in the vector? The "correct" removal algorithm in
that case looks like this: Swap the target element to the end, and then
decrease the vector's size by 1. But if I remember correctly, std::vector's
remove method doesn't work like that, so of course it will have terrible
worst-case performance if it was shifting all elements of the array when you
remove the first one.

2\. What's causing the worst-case slowdown? Are you sure you weren't simply
seeing the effects of the L2 CPU cache being invalidated?

A linked list will thrash the CPU cache constantly, whereas vectors thrash the
cache much less since the elements are stored in contiguous memory. For a
modern game running on a modern CPU, vectors are almost always the better
choice for that reason.

EDIT: I'm not sure your numbers are trustworthy. For example, there's no way
it rendered 35859663 frames in 100 seconds. That would be 358,596 FPS.

Assuming your FPS was actually something like 358, not 358,596, that's still
an extremely high FPS. When your FPS is so high, it becomes difficult to
accurately measure the reasons for worst-case performance. A sudden spike
within a single frame could be for any of a dozen reasons, such as CPU cache
eviction, a GPU pipeline stall, or some other program on your computer taking
CPU resources.

~~~
falcolas
> You can't bust out numbers like that

Nobody is preventing you from writing your own benchmark and "busting out
[your own] numbers". :)

That said, yes, my deletion algorithm is naive, and I welcome a pull request
to fix it.

[https://github.com/garrickp/intrusive_benchmark](https://github.com/garrickp/intrusive_benchmark)

> Are you sure you weren't simply seeing the effects of the L2 CPU cache being
> invalidated

Perhaps, but I doubt it's the cause of the worst case scenario, particularly
since I'm doing identical loops over the physics and renderable objects, and
as you say the linked list should have the worse performance.

EDIT: One more small point - if the programmer is relying on the order of the
items at all, the "tail tuck" method of deletion would not work.

EDIT[2]: Addressing your edit - the numbers are only for the action of
applying physics and adding/deleting objects - the code should reflect that.
There is no actual rendering time or AI or anything else which would prevent
us from seeing upwards of 100,000 "frames" per second.

The times are also fairly stable (the standard deviation of max times is
around 20%) between runs.

~~~
sillysaurus3
I apologize that my comment probably came off as a bit too upfront. I need to
work on that.

I was only saying that it isn't really a good idea to present numbers like
that without having a clear idea of what's causing the worst-case performance.
Glancing over the code, and assuming the first bullet fired is always the
first one to impact the ground, it looks like the code is always going to run
std::vector::erase on the first element of the vector, which will always shift
the 499 remaining elements. It's possible that's the reason for the worst-case
performance, rather than vector being inherently a worse choice than list.

It was very counterintuitive and weird to realize that vector is almost always
a better choice than list due to how powerful the CPU caches are. CPU caching
is so good that taking full advantage of caching will usually outweigh
algorithmic tweaks, like whether you're using a list or a vector. So the power
of vector is that its elements are in contiguous memory, whereas the downfall
of list is that they usually aren't.

It's difficult to prove that's true except in real-world scenarios. Unless a
test is designed to introduce memory fragmentation, the cache performance
penalty of using list won't really show up in the results. All of the list
allocations will be pretty much contiguous in a test environment, so the tests
will show performance on par with vector. But in the real engine, you'll be
left scratching your head why everything's running so slowly, when the answer
is because the elements are sprayed all across your memory space and it's
causing the CPU cache to stall when they're accessed...

EDIT:

 _There is no actual rendering time or AI or anything else which would prevent
us from seeing upwards of 100,000 "frames" per second._

When your FPS is higher than, say, 1,000, it becomes very difficult to pin
down the reasons for performance fluctuations. An engine under load behaves
differently than an artificial test, so you'll need to measure the performance
impact in a real engine, especially when the framerates in the test are so
high.

Your monitor is capped at 70 fps, or some high-end monitors can do ~130.
Something like 100,000 fps isn't an accurate reflection of the behavior of the
real system. Most of the discrepancy is due to memory fragmentation causing
CPU thrashing, but your specific test might be running into some other
discrepancies, like:

\- clock() might not be a high resolution timer, so the worst-case timing
might be due to discrepancies in the clock mechanism itself, especially at
100,000+ FPS. See [http://stackoverflow.com/questions/6749621/high-
resolution-t...](http://stackoverflow.com/questions/6749621/high-resolution-
timer-in-linux)

\- since your test is running for 100+ seconds and your timing measurements
are taking into account an entire frame, some frames are going to run
interrupts. If an interrupt takes a long time to process, that load will show
up in your timing measurements. The result is that vector (or list) will be
unfairly pinned as the reason for the performance issue, when interrupts were
the cause.

I'm not confident about the second one, but the point is that there are all
kinds of conflating variables.

~~~
falcolas
> So the power of vector is that its elements are in contiguous memory

In either case, you'll end jumping all over memory for the actual items in
either a std::vec or a linked list, since they are both holding pointers to
the actual objects (and based on how the nodes are being created, the linked
list containers are closer in memory to the nodes).

That said, my premise was not that linked lists were faster to iterate over,
it's that they offer a consistent addition/removal time, which is frequently
better for a game engine than access with huge deviations.

There are many games out there which boast 100+ FPS on my gaming computer.
However, they still look like crap because they will have intermittent drops
to 1 or even .5 fps. I (and many others) would prefer a constant 50 (or 30)
FPS than a stuttering 140.

Real time OSes offer the same commitments: constant gaps between program
resumes, even if that constant delay is large.

~~~
sillysaurus3
Okay, there's some kind of failure to communicate, and the frustration level
is rising. The failure is probably on my end, so I apologize.

 _In either case, you 'll end jumping all over memory for the actual items in
either a std::vec or a linked list, since they are both holding pointers to
the actual objects (and based on how the nodes are being created, the linked
list containers are closer in memory to the nodes). That said, my premise was
not that linked lists were faster to iterate over, it's that they offer a
consistent addition/removal time, which is frequently better for a game engine
than access with huge deviations._

CPU caching doesn't work like that. When you access any pointer, the CPU cache
will automatically cache 256 bytes from the access address. So if your object
is at address "x", everything from [x, x+256) will be put into the CPU cache.

Your Projectile class contains 6 floats, so each Projectile is 24 bytes. When
you're using std::vector, all of your projectiles are stored contiguously in
memory. That means when you access a projectile, you'll automatically cache
(256 bytes / sizeof(Projectile) - 1) = the next 9 projectiles. So you get a
bunch of performance for free, simply by using vector.

When you use list, you lose all of that caching. The reason is because list's
memory allocations are all over the place. When you use list, the memory
region [x, x+256) is solely that _one_ projectile. That means your CPU will
end up caching a bunch of irrelevant bytes of memory. That also means the CPU
will have to cache _each projectile individually_ as they're accessed. You're
trading automatic caching of 10 projectiles for having to cache each
individual projectile. Since taking advantage of CPU caching gives you 10x or
100x speedups, that's a very bad tradeoff.

Addition/removal from a container is a tiny, tiny fraction of tininess of what
a modern game engine does. It's so tiny that add/removal performance is almost
always irrelevant unless you're adding/removing a huge number of elements
every frame, which is rare. Iteration performance is almost the only thing
that matters for a game engine, because every single frame you'll be iterating
over hundreds of containers. If each iteration causes your CPU cache to
thrash, then you're in for a world of pain.

The takeaway is that list is an _awful_ choice for any realtime game, and it's
dangerous to give the impression that it's better. If someone were to take
your tests at face value and start using list in their own game engines, a few
months later they're going to be wondering why their performance is so
terrible and why their profilers aren't showing the cause. As far as
performance characteristics go, memory fragmentation / CPU caching matter so
much that discounting them is a recipe for disaster.

 _There are many games out there which boast 100+ FPS on my gaming computer.
However, they still look like crap because they will have intermittent drops
to 1 or even .5 fps. I (and many others) would prefer a constant 50 (or 30)
FPS than a stuttering 140._

Agreed! But list vs vector isn't the cause. A lot of the stuttering has to do
with loading resources from disk, flushing the GPU pipeline, or other I/O
intensive actions. On a modern CPU, add/removals from containers aren't the
reason for the slowdown.

Either way, good luck, and keep on making cool stuff.

~~~
quotemstr
> When you're using std::vector, all of your projectiles are stored
> contiguously in memory

The OP's point is that we're not talking about

    
    
      std::vector<projectile>
    

but about

    
    
      std::vector<projectile*>
    

The latter thrashes the cache, but it's necessary of an object needs to belong
to several lists at the same time.

In either case, enlarging the vector also has a significant cost.

~~~
sillysaurus3
That's true, but in modern game engines, objects of the same type share an
allocation pool. So a bunch of Projectiles should still be in contiguous
memory, since they're using the same pool allocator.

At that point, vector is still a better choice because it prevents the CPU
from accessing a bunch of intermediate list nodes which will pollute the CPU
cache. If you use a vector, there are no intermediate objects, so less stuff
winds up in the cache. There's also less memory fragmentation overall.

As long as you're using a vector implementation that grows by a factor of 2
every time it expands, then there shouldn't be any significant cost of
enlarging the vector. If it's just storing pointers, it won't invoke any copy
constructors. It can even be optimized to a straight-line memcpy, which is
blazingly fast on a modern CPU.

~~~
quotemstr
> That's true, but in modern game engines, objects of the same type share an
> allocation pool.

I don't think pool allocation guarantees memory contiguity: what if we have a
pool with two elements free, those elements being the first and last ones, and
we allocate two elements and stick pointers to them in our vector?

> a vector implementation that grows by a factor of 2....shouldn't be any
> significant cost

It depends on what you mean by "cost". I agree that the enlargement cost is
acceptable from an overall CPU budget sense, but each enlargement can still
cause a latency spike, since the cost for enlargement is still large
(especially considering the cost of taking the heap lock) for _that_
insertion. Sometimes consistency is more important than throughput.

> vector is still a better choice because it prevents the CPU from accessing a
> bunch of intermediate list nodes which will pollute the CPU cache

Who said anything about intermediate list nodes? The nice thing about an
intrusive linked list is that once you pay for accessing an element's cache
line, the rest of the accesses are nice and local. There are no intermediate
nodes, and all costs are smooth and consistent. Besides: you can issue a
memory prefetch instruction for node B while you process node A, further
hiding latency.

While a vector is value types might be best in most situations, if you can't
use that, an intrusive linked list may or may not be better than a vector of
pointers.

~~~
sillysaurus3
Good points! I love tech chats like this. There are a few caveats:

A pool should be trying to make new allocations as contiguous as possible.
That's accomplished by wrapping the allocations. For example, let's say we
have a pool with slots A, B, C, D, E. We allocate A through D, then C is later
freed. The pool shouldn't put the next allocation at C. It should be placed at
E. That way, the objects are still roughly sorted by creation time: A is
younger than B, which is younger than D, and so on.

(The next allocation should go in C after that, because there are no more
slots. But by that time D and E might have been freed, so it's still in sorted
order.)

That way, if you access any given element, it will cache the next element in
memory. Due to the sorted nature of the allocations, that means even if you're
iterating over them using some other container like a list or vector of
pointers, accessing an element will still _likely_ cache the next iterator's
element. Not guaranteed, but likely.

About the vector enlargement: Usually the cost is nil, because as long as the
vector stays under 1MB, you can simply preallocate large blocks of memory at
startup for each type of vector. Therefore the cost of expansion is just a few
arithmetic instructions for bookkeeping purposes, not anything expensive like
a heap lock.

Good point about the intrusive list. I personally wouldn't like it because it
makes the code more complicated, but as long as you're using a pool allocator,
the memory characteristics of an intrusive list shouldn't be too different
from the scheme I described above.

------
AdeptusAquinas
Very interesting article, if horrifying :)

As someone whose only experience is in corporate service IT development, from
the outside the game dev industry looks incredibly cowboy-ish; it would be
nice to read a story where a team built a game where they met their deadlines,
had no crunch times, followed excellent coding standards (CI, unit tests, peer
dev etc.) and maybe even used a modern language (perhaps, heretically, a non-
native language like C#!)

~~~
starmole
Interesting perspective! You bring up a lot of different points:

\- There are tons of positive "post mortem"
[[http://www.gamasutra.com/features/postmortem/](http://www.gamasutra.com/features/postmortem/)]
write ups on gamasutra, but the ones with problems are usually more valuable.
One recent example of "doing it right" I remember was the XCOM:Enemy Unknown
remake. Check their GDC talks!

\- One thing that makes games special is pushing the envelope. Yes, you can
make a game on a very solid stack. But it won't stand out.

\- Game coders think service people are mad also: How can you use a database
that you do not fully understand? You don't even have the source? Using a
framework or library or cloud that you do not 100% understand seems "cowboy"
to them.

\- Games are "fire and forget". That means technical debt is very different.
It is bad at the start of a 3 year cycle, but at the end it is totally ok.

\- Not mentioned by you but important to point out: While working on games is
great for learning and unlike anything else it is also a terrible industry for
coders. Insane crunch time. Low pay.

~~~
ido

        - Games are "fire and forget". That means technical debt
        is very different. It is bad at the start of a 3 year
        cycle, but at the end it is totally ok.
    

There is also a big difference between AAA, online, mobile, social, casual,
etc. f2p games for example are expected to live & be actively
supported/updated for many years if successful.

Clash of Clans was released 2.5 years ago and will likely continue to get
active support for many more years. FarmVille was released almost 5 years ago.
WoW more than 10, for a different kind of game that also isn't fire and
forget.

They get a lot of flack for other reasons but engineering wise companies like
zynga get a lot of stuff right.

~~~
starmole
You are absolutely right. I was intentionally polarizing. But thinking about
it more, one thing still stands out: Even long running mmos are planned "for
the next 10 years". There is an end in sight. Facebook, Google, Uber,
Bitcoins, Most Startups - it's open ended! I don't think it's easy to say one
approach is better - but it is useful to understand that they are
fundamentally different.

Personally as an engineer I like the coding approach of game dev a lot more.

------
EvanAnderson
Prior discussion:
[https://news.ycombinator.com/item?id=4491216](https://news.ycombinator.com/item?id=4491216)

------
gaius
_The team was incredibly invested in the project, and put in unheard of
efforts to complete the project while sacrificing personal health and family
life._

I wonder if this is "invested" as in, had a substantial equity stake and it
was worth the sacrifices of health and relationships because they and their
families were setup for life afterwards and retired to a beach in Malibu, or
some other meaning...

~~~
wingerlang
> .. because they and their families were setup for life afterwards and
> retired to a beach in Malibu ..

And if the game flopped then? Even a success doesn't guarantee the above does
it?

~~~
gaius
Sure, but they weren't taking that chance were they? They were just being
exploited.

------
soup10
Ironically the horrible path finding gave the game much of it's charm and
interesting micro mechanics. The default attack move would cause the units to
converge into a single file march... which is suicide against a grouped enemy
or defended position.

------
zzleeper
Not that it's not interesting, but a [2012] next to the title would have been
helpful

~~~
angersock
Not really, as the article is covering a project near 20 years old.

Who gives a shit when it was written?

------
loteck
Previous discussions of this entertaining series from 2012 here:

[https://news.ycombinator.com/item?id=4491216](https://news.ycombinator.com/item?id=4491216)

[https://news.ycombinator.com/item?id=4582123](https://news.ycombinator.com/item?id=4582123)

[https://news.ycombinator.com/item?id=5252003](https://news.ycombinator.com/item?id=5252003)

------
hybridtupel
Does anybody know what happened with part three? Was it ever published?

------
realrocker
Really like the story-telling style of the article. Was wondering where could
I find a compilation of similar "programming tales"? Apart from
[http://www.folklore.org/](http://www.folklore.org/) of course.

------
brainburn
He says the maximum number of units is 1600.

If 6 players mindcontrol a drone and an scv and build additional armies, the
total number of units could way exceed 1600. Or what am I missing?

~~~
ygra
You'd get the wonderful error »Cannot create more units«. There also was a
limit with certain sprites that counted towards that. It was not an uncommon
error to see in certain UMS maps that had way too many units running around.

~~~
plorkyeran
You'd also occasionally hit it in zerg-heavy 8 player melee games due to that
larvae and overlord count against the limit without consuming any supply. It
was always pretty hilarious to see a small pack of mutalisks kill a carrier
doom-fleet which couldn't launch any interceptors due to the unit limit.

