A pipelined machine

05 Jan 2014

The last processor which we invented was a register machine. Now, we are going to make another drastic revision to our processor in the name of speed, although this time we will succeed in retaining the same programming model, a feat which we did not accomplish when we discarded the stack machine.

Inside of a “single cycle” register machine as we described previously and in a stack machine a single operation (for instance adding two operands or writing a value to memory) is defined by the instruction set to take exactly one cycle. Now, because we have instructions which can access memory “in one cycle” for instance our load and store operations, a single cycle for our processor has a lower bound in time equal to the memory read/write time.

Recall the definition of step from last time

(defn step [state]
  (-> state
      fetch
      decode
      execute
      writeback))

</a>

With this step function, each “clock cycle” we sequentially evaluate the result of fetching an instruction, decoding an instruction, executing the instruction and then updating the logical state of the processor with the result of that computation. This is great for a simulator, but think about the implications if you were to build a chip that operates this way. The hardware clock ticks, a signal starts propagating through fetch which reads an instruction, then that instruction value flows through decode, then the decoded value flows through execution, then the executed result flows into writeback, and finally the clock ticks again. At each point in this process, all the hardware logic before the “step” through which signals are currently flowing has yielded a result and is sitting idle waiting for inputs again. What if we could always have a fresh signal for each element of this process so that assembly line style no component ever has to stop and wait for something farther up the data stream? In hardware, we can do this by inserting latches at the borders of each “stage” and then reducing the cycle time to the worst case signal propagation time of all of the stages.

Naively implementing this in our simulator is embarrassingly easy

1
2
3
4
5
6
(defn step [state]
  (-> state
      writeback
	  execute
	  decode
	  fetch))

Why does this do the trick? Because our processor model we previously is designed around storing the “return” value of each “stage” into the state object. Here, by running the processor in reverse, we use the state object to simulate the small memories which I mentioned we needed to buffer values between one stage and the next in a hardware pipeline.

So what does executing a simple program on this machine look like? Well let’s take a simple program:

1
2
3
4
5
[:add 1 1 :r_IMM 15]
[:add 2 2 :r_IMM 15]
[:add 3 3 :r_IMM 15]
[:add 4 4 :r_IMM 15]
[:halt 0 0 0 0]

And what does it do? With each line representing a machine cycle, and the instruction [:add 30 30 30 0] (add the zero register to the zero register and write the sum to the zero register) being the default “do nothing” instruction which each stage processes in lieu of having “real” work to do a chart of the processor state would look like this:

1
2
3
4
5
6
7
8
9
10
11
; fetch              decode               execute              writeback
;; machine start
[:add 1 1 :r_IMM 15] [:add 30 30 30 0]    [:add 30 30 30 0]    [:add 30 30 30 0]
[:add 2 2 :r_IMM 15] [:add 1 1 :r_IMM 15] [:add 30 30 30 0]    [:add 30 30 30 0]
[:add 3 3 :r_IMM 15] [:add 2 2 :r_IMM 15] [:add 1 1 :r_IMM 15] [:add 30 30 30 0]
[:add 4 4 :r_IMM 15] [:add 3 3 :r_IMM 15] [:add 2 2 :r_IMM 15] [:add 1 1 :r_IMM 15]
[:hlt 0 0 0 0]       [:add 4 4 :r_IMM 15] [:add 3 3 :r_IMM 15] [:add 2 2 :r_IMM 15]
[:add 30 30 30 0]    [:hlt 0 0 0 0]       [:add 4 4 :r_IMM 15] [:add 3 3 :r_IMM 15]
[:add 30 30 30 0]    [:add 30 30 30 0]    [:hlt 0 0 0 0]       [:add 4 4 :r_IMM 15]
[:add 30 30 30 0]    [:add 30 30 30 0]    [:add 30 30 30 0]    [:hlt 0 0 0 0]
;; machine halted

So on the first cycle, we fetch our first instruction and the rest of the pipeline being empty does nothing. On the second cycle we fetch the second instruction and decode the first instruction with the rest of the pipeline doing nothing. On the third cycle we execute the first instruction we fetched, decode the second instruction we fetched and fetch a third instruction. The writeback stage still has nothing useful to do. This stepping pattern continues until all instructions are executed and the machine halts.

Now there is an issue lurking here, what if we have an instruction stream that looks like this….

1
2
3
4
[:add 0 30 29 1]
[:add 0 0 29 1]
[:add 0 0 29 1]
[:add 0 0 29 1]

Just a bunch of +1 operations on a single register. Wasteful to be sure, but the point is that each operation relies upon the final value produced by the previous operation. Unfortunately, this is a problem for our pipelined processor design because while a value may be computed in the execute stage, and the register address to which it will be stored is fully known at that time the value has not yet become part of the processor’s steady state and cannot be accessed by the next cycle of execute. So we have to pause part of the processor and wait until the value becomes visible.

For the pipeline described here, this is implemented by providing extra logic in the decode stage which checks if the destination of the execute stage is an argument to the instruction it just decoded. If this is the case, then the decode stage resets itself and the fetch stage by one cycle so that in the next cycle the instruction will be repeated and will succeed as the depended value will have been written back.

For the most part these changes are trivial to add to our processor, consisting simply of predicate checks to make sure that the processor is not in a stalled state. However, the decode implementation is interesting…

(defn decode
  "Checks the stall counter, invoking ss/decode and checking for a
  pipeline hazzard in the result if zero otherwise returning the
  input state unmodified due to a pipeline stall."

  [processor]
  (if (stalled? processor)
    processor
    (let [next-processor (ss/decode processor)
          {:keys [decode execute]} next-processor
          {:keys [dst addr]} execute]
      (if (and (= :registers dst)
               (contains? (-> #{}
                              (into [(:a decode) (:b decode)])
                              (disj 30 29))
                          addr))
        (do (println "[decode ] stalling the pipeline!")
            (-> processor
                (assoc :stall 1)
                (dissoc :decode)))
        next-processor))))

</a>

By augmenting the fetch stage with a check which causes it to do no work when the stall count is nonzero, we can achieve the desired behavior of attempting to issue the same instruction two cycles in a row. We just need a simple change to our step operation which decreases the stall counter by one every iteration. Our new step is simply

(defn stall-dec
  "A dec which deals with a nil argument case, and has a floor value
  of 0."

  [nillable-value]
  (let [v (or nillable-value 0)]
    (max 0 (dec v))))


(defn step
  "Sequentially applies each phase of execution to a single processor
  state, returning an updated state representing a single cycle of
  computation's effects on the processor. Because this is a pipelined
  design, we simply take the in-order steps and run them in the
  reverse order using the processor state as a set of latches for
  storing the state between clock 'cycles'."

  [state]
  (-> state
      writeback
      ss/execute
      decode
      fetch
      (update-in [:stall] stall-dec)))

</a>

It is worth noting that stalls are not totally unavoidable. In some cases stalls also be resolved by providing data forwarding, where the execute stage “stores” the depended on value for a cycle without making it part of the documented processor state in a register, thus escaping a single stall cycle however in general this is not possible. Here we are considering only a single cycle required for instruction execution where real processors have 8 and 16 cycle execution sub-pipelines which demand stalling.

That’s all for now, but I’ll leave you with this. Again, full code is on Github for your convinience. Since execute now lags fetch by many cycles, what can we do to the fetch stage to ensure that the instructions fetched into the pipeline are the instructions which conditional and control instructions not yet evaluated will indicate? next time we will discuss one technique for achieving this end, branch prediction!