Building a branch predicted machine

Last time I signed off with a question: Given a fetch stage which runs many cycles ahead of the processor's fully evaluated steady state, how can we ensure that the fetch stage is filling the pipeline with useful instructions and not just wasting time filling the processor with crap that will have to be thrown away?

In order for such a thing to be possible, we must somehow be able to predict the state which the processor will reach after performing computations, without performing the requisite computation. We must, quite literally, predict the future. The accuracy with which we can do this will dramatically impact the performance of our processors by ensuring that we waste as few cycles as possible.

What makes us waste cycles in a pipelined processor design? We have value hazards which we already partially dealt with, and we have control hazards. Control hazard is the term for instructions fetched after a control flow (also known as a conditional or branch) instruction is fetched which will impact whether the instructions should be executed or not. Say we define the "if*" set of instructions to execute the N+1 instruction if and only if the predicate is true, and to otherwise execute the N+2 instruction, skipping the N+1.

Given these branching semantics, the four level pipeline which we described above will have fetched the N+2 instruction and decoded the N+1 instruction when the control instruction at N is computed. However, it is not until the next cycle that the outcome of the Nth instruction is visible to the rest of the processor, by which time the N+1 instruction will have completed execution and the N+2 instruction will have completed decode.

In the case of this processor, it makes sense to allow the N+1 and N+2 instructions to be executed and decoded because adding extra control logic to remove (or squash) whichever instruction the outcome of the branch instruction indicates is not correct is trivial. However, in the general case of a deeper pipeline wherein it is possible that an entire loop iteration or loop exit sequence has been fetched and partially decoded/executed it makes no sense to add the complicated instruction removal logic to every stage. Instead the decode stage will detect that the instruction falls into the set of branching instructions and will disable the fetch stage until the conditional instruction has cleared writeback. Clearly this would be painful for our simple pipeline, for a 16 or 32 cycle pipeline this branching cost is positively evil.

So we need to guess two things: what the outcome of conditional instructions will be, and what the next PC will be for every instruction. The first of these turns out to be relatively easy. Most processor time is spent inside of loops, and loops tend to terminate with a "jump to the top" statement of some sort. Ergo, most of the branches which a processor will encounter will be "jump to the top" branch instructions which are far more often than not taken. In fact this bias towards taking "backwards" jumps is so strong that simply predicting taken on every branch can yield a prediction success rate of 80%!

Okay great. Now we can make at least a crude estimation of whether a given branch will be taken or not. However, for the branch instructions described this information isn't very valuable. Because the "true" case is to execute the N+1 instruction and the "false" case is to skip N+1 and go for N+2, most code on this machine will have the form

1
2
3
[:if.. *  *  *   * ]
[:add  pc pc imm * ] ;; jump to true case
[:add  pc pc imm * ] ;; jump to false case

or alternatively

1
2
3
[:if.. *  *  *   * ]
[:add  pc pc imm * ] ;; jump to the true case
.................... ;; inline false case

Note that this means we more than likely have two back to back control hazards! Oh joy. So we can somewhat correctly predict the outcome of a branch, but that doesn't help the fetch stage fill our pipeline with useful instructions. So we need some more general structure for predicting the PC of the next instruction. Normally this is simple, the PC of the next instruction is simply the PC+4. However, as demonstrated with the instructions following conditionals, a highly meaningful set of control hazards are not presented by conditional instructions. This means that we need some more general structure for mapping the PC of one instruction to the PC of the next instruction. We can accomplish this by simply having a table which maps the entire valid PC address range [0, INT_MAX] to the PC which followed that address last time an instruction at that address was executed.

This however is problematic. Clearly we cannot build such a table, because it would be exactly the size of main DRAM memory. Thus the access time to our predictive data store would be as poor as the access time to our main memory! Entirely unacceptable! We want our branch prediction structure to be so fast that it can run alongside fetch! So now we start playing tricks to reduce the table size and improve the predictor performance. Clearly, most of a processor's memory is used for storing data not for storing code so right off the bat we know that our worst case size is overkill. Also we know that as programs tend to spend most of their time in tight inner loops the address range for which we must be able to issue next address predictions is rather small. Also we know that most of the space in that table of next instruction addresses would be wasted because the overwhelming majority of addresses would simply indicate N+1. This suggests that we really want to be making two predictions: 1) is the next PC N+1, and if not 2) what is the next pc?

To make this first prediction, I propose that we adapt the classic bimodal branch predictor from McFarling. The bimodal predictor is bloody simple and can achieve a prediction accuracy of 94% or better. We will define a bimodal predictor table of 1024 entries, and on every fetch we will consult the table to see whether it predicts taken or not taken for the current index, computed by taking the bottom 9 bits of the current PC. If the prediction is not taken, then the next PC is PC+1.

By mapping PCs to two bit counters with values [0,1] representing branch not taken and [2,3] representing branch taken. Using this representation, given a single counter we can predict whether the next PC of an instruction is PC+4 or whether it is some other value.

(defn two-bit-counter [v op]
  (case op
    (:inc) (min (inc (or v 2)) 3)
    (:dec) (max (dec (or v 2)) 0)))

Not to hard to build, but now we need to build a predictive structure out of these cells. Remember that we are using a bit vector (integer) addressed table of cells to map from what are essentially hashed (history, PC) pairs to a counter, so to train a specific counter one way or another we have to compute the table index and then delegate to two-bit-counter.

(defn train-pred [p addr hst res]
  (let [a [(bit-xor (bit-and 0x1FF addr)
                    (vec->bitv hst))]]
    (case res
      (:taken)     (update-in p a two-bit-counter :inc)
      (:not-taken) (update-in p a two-bit-counter :dec))))

Great. So we can represent a table of our counters, but we need one more piece, a way to make a decision from the counter table. Remember that we defined [0,1] to be not taken and [2,3] to be taken, so we can implement this as follows:

(defn predict-pred [p addr hst]
  (>= (get p (bit-xor addr (vec->bitv hst)) 2) 2))

Okay, now we need to fit this predictor structure into our existing processor. In order to do this, I'm simply going to create the :predictor substructure in the processor state, which will feature the keys #{:hst, :pred :jump-map}. :hst will be a sequence of booleans representing the branch history, :pred will be our two bit counter table, and :jump-map will map PCs to their last jump target.

With this definition in hand we can write two training helpers functions to let us update the two bit counter tables in a given processor instance...

(defn train-jump
  [processor this next]
  (let [{:keys [hst]} (:predictor processor)]
    (-> processor
        (update-in [:predictor :pred] train-pred this hst :taken)
        (update-in [:predictor :jump-map] assoc this next))))


(defn train-step
  [processor]
  (let [{:keys [hst]} (:predictor processor)
        this          (common/get-register processor 31)]
    (-> processor
        (update-in [:predictor :pred] train-pred this hst :not-taken))))

now we need to be able to update the processor branch history...

(defn update-history
  [processor outcome]
  (update-in processor [:predictor :hst]
             (fn [hist]
               (into (ring-buffer 9)
                     (conj hist outcome)))))


(defn update-taken
  [processor]
  (update-history processor true))


(defn update-not-taken
  [processor]
  (update-history processor false))

Finally, we need to be able to predict the next PC value using these various helpers and the predictive structure we've defined.

(defn next-pc
  "Examines a processor state, determining the next value for the
  PC. Note that this relies on support from writeback to record PC
  value transitions when jumps are executed."

  [processor]
  (let [{:keys [jump-map pred hst]} (:predictor processor)
        pc                          (common/get-register processor 31)]
    (if (and (contains? jump-map pc)
             (predict-pred pred pc hst))
      (get jump-map pc)
      (+ 4 pc))))

And that's all the code we need to be able to interact with the processor's predictor structures. Now we need to implement a fetch stage which will query the branch predictor structure to determine what the address of the next instruction is if it isn't PC+4.

(defn fetch
  "Checks the stall counter, invoking ss/fetch if zero otherwise
  returning the input state unmodified due to a pipeline stall."

  [processor]
  (if (p/stalled? processor)
    processor
    (let [pc    (common/get-register processor 31)
          icode (common/get-memory processor pc)
          npc   (next-pc processor)]
      (info "[fetch    ]" pc "->" icode " npc:" npc)
      (-> processor
          (assoc-in [:registers 31] npc)
          (assoc :fetch {:icode icode
                         :npc   npc
                         :pc    (+ pc 4)})))))

Note that this implementation has to preserve the original pipelined processor's behavior of respecting the stalled state of the pipeline, in addition to providing next PC lookup from our branch predictor structure. Also note that I have introduced the :npc key to the fetch value. This key obviously represents the predicted address of the next instruction as it is no longer strictly equal to PC+4. This is important because when we get to writeback we will need to be able to determine whether or not a branch really needs to generate a flush and a jump.

Now lets kick out the last interesting piece of this simulator, the writeback stage. Writeback for branch predicted processors takes responsibility for training, and needs to deal with a logical case that our previous writeback stages have ignored. Previously, we've assumed that jumps will never target the fetch PC of the next instruction in the pipeline. However, if branch prediction is working perfectly, the next instruction in the pipeline will always be the target of the preceding branch. If we correctly branch to the predicted branch target, then we want to positively train our predictive structure because it made the right guess. If we branch to somewhere else, then we have to incur a full pipeline stall (ouch!) and train the predictive structure not to make the same mistake again. However in the general case of "normal" instructions which do not cause branching behavior we don't need to update the predictor at all because the predictor assumes that any instruction for which it does not have prior information the next PC is PC+4, which is correct for all non branch instructions. As we have already dealt with training branches, we don't need to do anything more training wise.

However, for every instruction processed we must update our branching history, which means that when we process a branch instruction we have to add a 1 to the history sequence, and otherwise we add an 0 respectively using update-taken and update-not-taken.

That's it, so lets build a new writeback...

(defn writeback
  "Pulls a writeback directive out of the processor state, and
  performs the indicated update on the processor state. Update command
  have been restructured and are now maps
  {:dst (U :registers :halt :memory) :addr Int :val Int}."

  [processor]
  (let [directive (get processor :execute [:registers 30 0])
        {:keys [dst addr val pc npc]} directive]
    (info "[writeback]" directive)
    (cond ;; special case to stop the show
          (= :halt dst)
            (assoc processor :halted true)

          ;; special case for hex code printing
          (and (= :registers dst)
               (= 29 addr))
            (do (when-not (zero? val)
                  (print (format "0x%X" (char val))))
                processor)

          ;; special case for printing
          (and (= :registers dst)
               (= 30 addr))
            (do (when-not (zero? val)
                  (print (char val)))
                processor)

          ;; special case for branching as we must flush the pipeline
          (and (= :registers dst)
               (= 31 addr)
               (not (= val npc))) ;; don't flush if we aren't changing
                                  ;; the next PC value. This means to
                                  ;; jumping to PC+4 does exactly
                                  ;; nothing as it should.
            (do (warn "[writeback] flushing pipeline!")
                (-> processor
                    (update-taken)
                    (train-jump (- pc 4) val)
                    (dissoc :fetch)
                    (dissoc :decode)
                    (dissoc :execute)
                    (assoc-in [:registers addr] val)))

          ;; special case for correctly predicted branching
          (and (= :registers dst)
               (= 31 addr)
               (= val npc)) ;; In this case we need to reinforce the
                            ;; branch prediction.
            (-> processor
                (update-taken)
                (train-jump (- pc 4) val))

          true
            (-> processor
                (update-not-taken)
                (assoc-in [dst addr] val)))))

And that's it! We only need a step function and a -main function, neither of which is different at all from the previous pipeline implementation(s) save for making reference to this new fetch and writeback.

Now we can start to play more tricks with using more complicated branch predictors. In fact building better branch predictors is still an active research area. We can also play tricks with tracking more processor state. Real world branch predictors play all kinds of crazy tricks. One example of this is loop trip count prediction.

Branch prediction works well for frequently repeated loops, however the branch predictor presented here will always be wrong for the last iteration of a loop. We could imagine a branch predictor which determines the two values used in the loop back edge test and predicts how many more times the program will execute the loop exactly thus escaping the exit misprediction.

Now that we can fill the pipeline with mostly useful instructions, we need to find another avenue for doing more work per cycle. Next time we'll look at a technique for exploiting instruction level parallelism: superscalar architectures!

^d