# Building a branch predicted machine

06 Jan 2014Last time I signed off with a question: Given a fetch stage which runs many cycles ahead of the processor’s fully evaluated steady state, how can we ensure that the fetch stage is filling the pipeline with useful instructions and not just wasting time filling the processor with crap that will have to be thrown away?

In order for such a thing to be possible, we must somehow be able to predict the state which the processor will reach after performing computations, without performing the requisite computation. We must, quite literally, predict the future. The accuracy with which we can do this will dramatically impact the performance of our processors by ensuring that we waste as few cycles as possible.

What makes us waste cycles in a pipelined processor design? We have
value hazards which we already partially dealt with, and we have
control hazards. Control hazard is the term for instructions fetched
*after* a control flow (also known as a conditional or branch)
instruction is fetched which will impact whether the instructions
should be executed or not. Say we define the “if*” set of instructions
to execute the N+1 instruction if and only if the predicate is true,
and to otherwise execute the N+2 instruction, skipping the N+1.

Given these branching semantics, the four level pipeline which we described above will have fetched the N+2 instruction and decoded the N+1 instruction when the control instruction at N is computed. However, it is not until the next cycle that the outcome of the Nth instruction is visible to the rest of the processor, by which time the N+1 instruction will have completed execution and the N+2 instruction will have completed decode.

In the case of this processor, it makes sense to allow the N+1 and N+2 instructions to be executed and decoded because adding extra control logic to remove (or squash) whichever instruction the outcome of the branch instruction indicates is not correct is trivial. However, in the general case of a deeper pipeline wherein it is possible that an entire loop iteration or loop exit sequence has been fetched and partially decoded/executed it makes no sense to add the complicated instruction removal logic to every stage. Instead the decode stage will detect that the instruction falls into the set of branching instructions and will disable the fetch stage until the conditional instruction has cleared writeback. Clearly this would be painful for our simple pipeline, for a 16 or 32 cycle pipeline this branching cost is positively evil.

So we need to guess two things: what the outcome of conditional instructions will be, and what the next PC will be for every instruction. The first of these turns out to be relatively easy. Most processor time is spent inside of loops, and loops tend to terminate with a “jump to the top” statement of some sort. Ergo, most of the branches which a processor will encounter will be “jump to the top” branch instructions which are far more often than not taken. In fact this bias towards taking “backwards” jumps is so strong that simply predicting taken on every branch can yield a prediction success rate of 80%!

Okay great. Now we can make at least a crude estimation of whether a given branch will be taken or not. However, for the branch instructions described this information isn’t very valuable. Because the “true” case is to execute the N+1 instruction and the “false” case is to skip N+1 and go for N+2, most code on this machine will have the form

1
2
3

[:if.. * * * * ]
[:add pc pc imm * ] ;; jump to true case
[:add pc pc imm * ] ;; jump to false case

or alternatively

1
2
3

[:if.. * * * * ]
[:add pc pc imm * ] ;; jump to the true case
.................... ;; inline false case

Note that this means we more than likely have two back to back control hazards! Oh joy. So we can somewhat correctly predict the outcome of a branch, but that doesn’t help the fetch stage fill our pipeline with useful instructions. So we need some more general structure for predicting the PC of the next instruction. Normally this is simple, the PC of the next instruction is simply the PC+4. However, as demonstrated with the instructions following conditionals, a highly meaningful set of control hazards are not presented by conditional instructions. This means that we need some more general structure for mapping the PC of one instruction to the PC of the next instruction. We can accomplish this by simply having a table which maps the entire valid PC address range [0, INT_MAX] to the PC which followed that address last time an instruction at that address was executed.

This however is problematic. Clearly we cannot build such a table, because it would be exactly the size of main DRAM memory. Thus the access time to our predictive data store would be as poor as the access time to our main memory! Entirely unacceptable! We want our branch prediction structure to be so fast that it can run alongside fetch! So now we start playing tricks to reduce the table size and improve the predictor performance. Clearly, most of a processor’s memory is used for storing data not for storing code so right off the bat we know that our worst case size is overkill. Also we know that as programs tend to spend most of their time in tight inner loops the address range for which we must be able to issue next address predictions is rather small. Also we know that most of the space in that table of next instruction addresses would be wasted because the overwhelming majority of addresses would simply indicate N+1. This suggests that we really want to be making two predictions: 1) is the next PC N+1, and if not 2) what is the next pc?

To make this first prediction, I propose that we adapt the classic bimodal branch predictor from McFarling. The bimodal predictor is bloody simple and can achieve a prediction accuracy of 94% or better. We will define a bimodal predictor table of 1024 entries, and on every fetch we will consult the table to see whether it predicts taken or not taken for the current index, computed by taking the bottom 9 bits of the current PC. If the prediction is not taken, then the next PC is PC+1.

By mapping PCs to two bit counters with values `[0,1]`

representing
branch not taken and `[2,3]`

representing branch taken. Using this
representation, given a single counter we can predict whether the next
PC of an instruction is PC+4 or whether it is some other value.

</a>

Not to hard to build, but now we need to build a predictive structure
out of these cells. Remember that we are using a bit vector (integer)
addressed table of cells to map from what are essentially hashed
`(history, PC)`

pairs to a counter, so to train a specific counter one
way or another we have to compute the table index and then delegate to
`two-bit-counter`

.

</a>

Great. So we can represent a table of our counters, but we need one
more piece, a way to make a decision from the counter table. Remember
that we defined `[0,1]`

to be not taken and `[2,3]`

to be taken, so we
can implement this as follows:

</a>

Okay, now we need to fit this predictor structure into our existing
processor. In order to do this, I’m simply going to create the
`:predictor`

substructure in the processor state, which will feature
the keys `#{:hst, :pred :jump-map}`

. `:hst`

will be a sequence of
booleans representing the branch history, `:pred`

will be our two bit
counter table, and `:jump-map`

will map PCs to their last jump target.

With this definition in hand we can write two training helpers functions to let us update the two bit counter tables in a given processor instance…

</a>

now we need to be able to update the processor branch history…

</a>

Finally, we need to be able to predict the next PC value using these various helpers and the predictive structure we’ve defined.

</a>

And that’s all the code we need to be able to interact with the processor’s predictor structures. Now we need to implement a fetch stage which will query the branch predictor structure to determine what the address of the next instruction is if it isn’t PC+4.

</a>

Note that this implementation has to preserve the original pipelined
processor’s behavior of respecting the stalled state of the pipeline,
in addition to providing next PC lookup from our branch predictor
structure. Also note that I have introduced the `:npc`

key to the
fetch value. This key obviously represents the predicted address of
the next instruction as it is no longer strictly equal to PC+4. This
is important because when we get to writeback we will need to be able
to determine whether or not a branch really needs to generate a flush
and a jump.

Now lets kick out the last interesting piece of this simulator, the
writeback stage. Writeback for branch predicted processors takes
responsibility for training, and needs to deal with a logical case
that our previous writeback stages have ignored. Previously, we’ve
assumed that jumps will never target the fetch PC of the next
instruction in the pipeline. However, if branch prediction is working
perfectly, the next instruction in the pipeline will *always* be the
target of the preceding branch. If we correctly branch to the
predicted branch target, then we want to positively train our
predictive structure because it made the right guess. If we branch to
somewhere else, then we have to incur a full pipeline stall (ouch!)
and train the predictive structure not to make the same mistake
again. However in the general case of “normal” instructions which do
not cause branching behavior we don’t need to update the predictor at
all because the predictor assumes that any instruction for which it
does not have prior information the next PC is PC+4, which is correct
for all non branch instructions. As we have already dealt with
training branches, we don’t need to do anything more training wise.

However, for every instruction processed we must update our branching
history, which means that when we process a branch instruction we have
to add a `1`

to the history sequence, and otherwise we add an `0`

respectively using `update-taken`

and `update-not-taken`

.

That’s it, so lets build a new writeback…

</a>

And that’s it! We only need a `step`

function and a `-main`

function,
neither of which is different at all from the previous pipeline
implementation(s) save for making reference to this new `fetch`

and
`writeback`

.

Now we can start to play more tricks with using more complicated branch predictors. In fact building better branch predictors is still an active research area. We can also play tricks with tracking more processor state. Real world branch predictors play all kinds of crazy tricks. One example of this is loop trip count prediction.

Branch prediction works well for frequently repeated loops, however the branch predictor presented here will always be wrong for the last iteration of a loop. We could imagine a branch predictor which determines the two values used in the loop back edge test and predicts how many more times the program will execute the loop exactly thus escaping the exit misprediction.

Now that we can fill the pipeline with mostly useful instructions, we need to find another avenue for doing more work per cycle. Next time we’ll look at a technique for exploiting instruction level parallelism: superscalar architectures!

^d