
A Journey Through the CPU Pipeline (2013) - nkurz
http://www.gamedev.net/page/resources/_/technical/general-programming/a-journey-through-the-cpu-pipeline-r3115
======
nkurz
I understand some aspects of modern processors fairly well, but there are
still some simple aspects of branch prediction and speculative execution that
leave me bewildered. Perhaps someone here can explain to me what actually
happens in a specific case for Haswell or some other recent processor?

Assume I have a list of id's, and each id is the index into array of strings.
The strings are 0-terminated, and of uniform random length from 1-32 bytes. I
want to create an output buffer that consists of the concatenated strings
corresponding to the id's in the list.

The pseudo C code would look like this:

    
    
      char *stringList[NUM_STRINGS];    // strings of length 1-32
      int idList[LIST_LEN]; // initialized to rand() % NUM_STRINGS
      for (int i = 0; i < LIST_LEN; i++) {
        int id = idList[i];             // id from existing list
        char *string = stringList[id];  // use id to get string
        while (*output++ = *string++);  // copy string to output
      }
    

Assume that stringList is large enough that the strings are not in cache, and
thus there will be a ~300 cycle latency after each is loaded. Presumably, the
"copy" loop will be speculatively executed during this latency. But since the
string lengths are random, the termination for this loop can't be well
predicted, and since the data for the copy isn't there yet, I'm not sure
exactly what happens.

Does the processor predict that the string copy loop will never terminate, and
keep buffering string copy instructions until the buffers are full? Which
buffers? Does it stop speculating at some maximum number of branches? If so,
what is this depth? Or does it manage to guess that the inner string copy loop
terminates after some reasonable number of iterations so it can issue the
outer loop load for the next stringList[id]? If so, how many such loads will
it manage to issue?

~~~
brigade
Completely random from 1-32 -> first order branch predictor determines that
the branch is usually taken, and generally predicts it as such. There is
likely a second level predictor that notices that after 16 taken branches it
becomes more likely to not be taken, so on average it'll predict loops of
length 16.

The most relevant buffers are (sizes for Haswell):

\- 192 µop reorder buffer (ROB)

\- 42 entry store buffer

\- 72 entry load buffer

So assuming an average length of 16, the processor should issue far enough
ahead to keep ~3 outstanding strings being loaded from memory simultaneously
(3*16 byte stores = 48, larger than the store buffer.) Then when a branch
mispredicts occur, the recovery is quick since the needed cachelines are
likely already in L1 (unless the string is longer than average and crosses
another cacheline.)

If you wanted to speed this up, you could prefetch several (~19) strings ahead
in the list. (300 / 16 byte stores at 1 store per cycle = 18.75)

~~~
nkurz
Thanks! Would this mean that if the strings were a constant length longer than
42 (or possibly longer than the longest predictable loop, which Agner says is
"32 or a little more"), that the 42 entry store buffer would be the limiting
factor and the next outer load would only be issued after the string begins to
be copied? Or is there some limit on the number of outstanding speculative
branches kicks in first?

Also, can you point to any further documentation on the behavior of the second
level predictor? Most things I've seen suggest that it matches only exact
patterns, and thus can only predict patterns that happen at least twice in a
row. I haven't been able to find anything about averaging.

~~~
brigade
Hm, I assume that if it were longer strings such that the branch is always
predicted taken, that yeah it wouldn't speculate to the next string until the
mispredict is found, which won't happen until the data gets into registers.
If, on the other hand, you had Pascal-style strings with a known length up
front instead of null-terminated strings, then the mispredict could be
resolved independently of loading every byte of the string, and it could have
outstanding loads for two strings simultaneously. But only for strings that
span more than one cacheline.

The only limit for speculation is the OoOE buffers. I abbreviated it, but the
inner loop is a minimum of 5 µops, so you actually end up hitting the ROB
limit at 39 iterations before the store buffer is full. Though some
architectures could reduce it to 4 with a compare+branch instruction. (x86
fuses this sequence to one µop but I'm not sure if it reduces the ROB entries)

If you don't hit the ROB limit, then even after you hit the store buffer
limit, the load buffer can take some loads. So assuming 4 µops, 42 stores and
48 loads would be actively buffered while their cache misses are resolved.

There's no public documentation on what modern CPUs actually do for branch
prediction; that's why I said "likely" \- I'm not claiming that any specific
CPU does precisely what I described. Just that it's likely to be better than
"predict the branch as always taken." The more general "hash pc history as
input to branch prediction" is a better way to look at it. How it's hashed and
used is a trade secret.

Actually, thinking about it again, for truly random data the only prediction
improvement over "always taken" is predicting the 32nd consecutive inner
branch as not taken. Predicting not taken before that increases the mispredict
rate...

------
yoklov
Great article.

A similar, but much more in depth look at the GPU pipeline is available here
[0] for those interested.

[0]: [https://fgiesen.wordpress.com/2011/07/09/a-trip-through-
the-...](https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-
pipeline-2011-index/)

------
halayli
memory ordering can get complicated. Here's further discussion from intel's
software developer manual:

section 8.2:

[http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-
software-developer-system-programming-manual-325384.pdf)

~~~
renox
Cache management is also quite complicated, I was really surprised to read:
[https://www.reddit.com/r/programming/comments/3p9idq/false_s...](https://www.reddit.com/r/programming/comments/3p9idq/false_sharing_curiosities/)
[http://stackoverflow.com/questions/29199779/false-sharing-
an...](http://stackoverflow.com/questions/29199779/false-sharing-and-128-byte-
alignment-padding) apparenly 'false sharing' like effect can happen with the
cache line next to the one you're using.. Uh??

------
witty_username
> Those workflows are relatively rare; Hyper-Threading usually provides
> consumers with approximately double the speed they would see for their
> everyday computer tasks.

Not really, IIRC it's more like 30-50%.

------
joseraul
Section "An Example" in the middle of the page is an original description of
the pipeline, from the point of view of an instruction.

