Building an out of order machine

08 Jan 2014

Okay. Previously we tried to get more parallelism and work per cycle out of our machines by running two streams of instructions more or less in parallel by having two parallel synchronized pipelines. But we hit a snag with this: data dependencies in between pipelines forced us to stall, thus loosing us whatever benefit we had hoped to reap from instruction level parallelism.

We can make some performance gains by informing compiler writers of the internal workings of our pipeline(s) and giving them hints about how best to arrange instructions so as to minimize stalls incurred by data dependencies, but there are limits to how insane compilers and compiler writers are willing to go in optimizing for a specific architecture. Besides, what if we could detect instruction level parallelism in hardware and take advantage of it in hardware rather than relying on smarter compilers to play those tricks on our behalf.

Well, how did we do data dependency detection? We retained a small data structure representing the registers to which changes were being computed and we stalled the pipeline until the requisite changes had been fully realized. So what happens when we have an instruction N which suffers from a data dependency, and an N+1 that is not so bound? N+1 gets hosed and has to wait in line until N clears the pipeline. We can do better than this! By creating some sort of structure representing instructions which are waiting to be executed, we can fill the execution pipeline every cycle by simply selecting the next instruction which is not data bound.

But what about our programming model? As long as we’ve been designing hardware the assumption is that if the Nth instruction makes a change all instructions after that Nth instruction see the steady state left by the Nth instruction. Therefor if we allow the N+2nd instruction to make visible changes before the Nth instruction, we have violated our programming model and programs may break! To resolve this we need to be able to reorder results. We will add a 5th stage to our pipeline, in between execution and writeback, which buffers writeback commands and only allows them to be written back in the order of nominal single cycle instruction execution.

What happens then if we have a data dependency between an instruction which has already been executed out of order but has not escaped reorder and another instruction in the issue buffer which would be good to go if the value stalled in reorder were visible? Can we somehow resolve this unneeded stall? In order to resolve this, we need some way to make buffered reorder values visible to the out of order instruction stream in addition to the values in registers. Know what the reorder stage is starting to sound like? Another register array! What if we extended our register set to include a large number of programmer-invisible registers, and then achieved out of order behavior by translating the operands of instructions in the issue buffer to aliases in this range of hidden registers. So an addition operation which nominally reads registers 0 and 1, storing its result to register 2 may in fact read from two arbitrary hidden registers containing the out of order future values of registers 0 and 1, and then write the future value of register 2 to another alias register storing a mapping from this alias for the future value of 2 to the register 2 as the buffered writeback command in the reorder stage.

Now get this: as long as this mapping from architectural register 2 to the renamed future value of register 2 is preserved, the issue stage is free to take the next instruction which reads the value of architectural register 2, rewrite it to read from the alias for the future value of register 2 and execute it, writing back to a new alias register. So long as the out of order processor does not run out of space in its alias table and continues to write back renamings from the reorder stage this processor can follow individual data dependency streams arbitrarily far into the future!

Now this comes with a caveat: really awesome branch prediction now becomes even more crucial, and we need some way to uniquely and monotonically identify instructions in issue and reorder with respect to some previous state of the PC. This is all because if a jump is misspredicted, then correcting the missprediction involves dumping all processor the internal state representing staged future values which are now incorrect with respect to the confirmed steady state of the processor.

And that’s all I’ve got. There are more funky architectures out there, but every major desktop and server processor in the last 20 years or so can be constructed from these principles. There are some things I’ve left out: vector processors and data flow processors, but vector processors tend to be special purpose computers (graphics cards) and data flow processors don’t exist yet so I’m content to call it a wrap here. I hope you enjoyed the ride and learned something!