A pipelined machine
05 Jan 2014The last processor which we invented was a register machine. Now, we are going to make another drastic revision to our processor in the name of speed, although this time we will succeed in retaining the same programming model, a feat which we did not accomplish when we discarded the stack machine.
Inside of a “single cycle” register machine as we described previously and in a stack machine a single operation (for instance adding two operands or writing a value to memory) is defined by the instruction set to take exactly one cycle. Now, because we have instructions which can access memory “in one cycle” for instance our load and store operations, a single cycle for our processor has a lower bound in time equal to the memory read/write time.
Recall the definition of step
from last time
</a>
With this step function, each “clock cycle” we sequentially evaluate the result of fetching an instruction, decoding an instruction, executing the instruction and then updating the logical state of the processor with the result of that computation. This is great for a simulator, but think about the implications if you were to build a chip that operates this way. The hardware clock ticks, a signal starts propagating through fetch which reads an instruction, then that instruction value flows through decode, then the decoded value flows through execution, then the executed result flows into writeback, and finally the clock ticks again. At each point in this process, all the hardware logic before the “step” through which signals are currently flowing has yielded a result and is sitting idle waiting for inputs again. What if we could always have a fresh signal for each element of this process so that assembly line style no component ever has to stop and wait for something farther up the data stream? In hardware, we can do this by inserting latches at the borders of each “stage” and then reducing the cycle time to the worst case signal propagation time of all of the stages.
Naively implementing this in our simulator is embarrassingly easy
1
2
3
4
5
6
(defn step [state]
(-> state
writeback
execute
decode
fetch))
Why does this do the trick? Because our processor model we previously is designed around storing the “return” value of each “stage” into the state object. Here, by running the processor in reverse, we use the state object to simulate the small memories which I mentioned we needed to buffer values between one stage and the next in a hardware pipeline.
So what does executing a simple program on this machine look like? Well let’s take a simple program:
1
2
3
4
5
[:add 1 1 :r_IMM 15]
[:add 2 2 :r_IMM 15]
[:add 3 3 :r_IMM 15]
[:add 4 4 :r_IMM 15]
[:halt 0 0 0 0]
And what does it do? With each line representing a machine cycle, and
the instruction [:add 30 30 30 0]
(add the zero register to the zero
register and write the sum to the zero register) being the default “do
nothing” instruction which each stage processes in lieu of having
“real” work to do a chart of the processor state would look like this:
1
2
3
4
5
6
7
8
9
10
11
; fetch decode execute writeback
;; machine start
[:add 1 1 :r_IMM 15] [:add 30 30 30 0] [:add 30 30 30 0] [:add 30 30 30 0]
[:add 2 2 :r_IMM 15] [:add 1 1 :r_IMM 15] [:add 30 30 30 0] [:add 30 30 30 0]
[:add 3 3 :r_IMM 15] [:add 2 2 :r_IMM 15] [:add 1 1 :r_IMM 15] [:add 30 30 30 0]
[:add 4 4 :r_IMM 15] [:add 3 3 :r_IMM 15] [:add 2 2 :r_IMM 15] [:add 1 1 :r_IMM 15]
[:hlt 0 0 0 0] [:add 4 4 :r_IMM 15] [:add 3 3 :r_IMM 15] [:add 2 2 :r_IMM 15]
[:add 30 30 30 0] [:hlt 0 0 0 0] [:add 4 4 :r_IMM 15] [:add 3 3 :r_IMM 15]
[:add 30 30 30 0] [:add 30 30 30 0] [:hlt 0 0 0 0] [:add 4 4 :r_IMM 15]
[:add 30 30 30 0] [:add 30 30 30 0] [:add 30 30 30 0] [:hlt 0 0 0 0]
;; machine halted
So on the first cycle, we fetch our first instruction and the rest of the pipeline being empty does nothing. On the second cycle we fetch the second instruction and decode the first instruction with the rest of the pipeline doing nothing. On the third cycle we execute the first instruction we fetched, decode the second instruction we fetched and fetch a third instruction. The writeback stage still has nothing useful to do. This stepping pattern continues until all instructions are executed and the machine halts.
Now there is an issue lurking here, what if we have an instruction stream that looks like this….
1
2
3
4
[:add 0 30 29 1]
[:add 0 0 29 1]
[:add 0 0 29 1]
[:add 0 0 29 1]
Just a bunch of +1 operations on a single register. Wasteful to be sure, but the point is that each operation relies upon the final value produced by the previous operation. Unfortunately, this is a problem for our pipelined processor design because while a value may be computed in the execute stage, and the register address to which it will be stored is fully known at that time the value has not yet become part of the processor’s steady state and cannot be accessed by the next cycle of execute. So we have to pause part of the processor and wait until the value becomes visible.
For the pipeline described here, this is implemented by providing extra logic in the decode stage which checks if the destination of the execute stage is an argument to the instruction it just decoded. If this is the case, then the decode stage resets itself and the fetch stage by one cycle so that in the next cycle the instruction will be repeated and will succeed as the depended value will have been written back.
For the most part these changes are trivial to add to our processor,
consisting simply of predicate checks to make sure that the processor
is not in a stalled state. However, the decode
implementation is
interesting…
</a>
By augmenting the fetch
stage with a check which causes it to do no
work when the stall count is nonzero, we can achieve the desired
behavior of attempting to issue the same instruction two cycles in a
row. We just need a simple change to our step
operation which
decreases the stall counter by one every iteration. Our new step is simply
</a>
It is worth noting that stalls are not totally unavoidable. In some cases stalls also be resolved by providing data forwarding, where the execute stage “stores” the depended on value for a cycle without making it part of the documented processor state in a register, thus escaping a single stall cycle however in general this is not possible. Here we are considering only a single cycle required for instruction execution where real processors have 8 and 16 cycle execution sub-pipelines which demand stalling.
That’s all for now, but I’ll leave you with this. Again, full code is on Github for your convinience. Since execute now lags fetch by many cycles, what can we do to the fetch stage to ensure that the instructions fetched into the pipeline are the instructions which conditional and control instructions not yet evaluated will indicate? next time we will discuss one technique for achieving this end, branch prediction!
^d