Building a register machine

Stack machines are really nice and simple to describe and build. It's also really easy to build a compiler for a stack machine because as shown above all a stack machine compiler has to do is rewrite the AST of the input program into data pushes and operation invocations.

Unfortunately, that's also all that stack machines have going for them. Examination of programs written for a stack machines shows that the majority of instructions are used to manage the state of the stack. Also, and here's the kicker, stack machines are bloody slow and can't get any faster. Why? Because every single primitive operation of the stack machine manipulates the stack.

Why is that a problem? Well consider the classical implementation of a stack in ANSI C

typed struct {
    int* base;
    int  size;
    int  used;
} stack;
       
int push(stack* s, int a) {
  if(s->used < size) {
    s->base[++s->used] = a;
        return 0;
    } else {
        return -1
    }
}
       
int pop(stack* s, int* dst) {
  if(s->used >= 1) {
    *dst = s->base[--s->used];
        s->used--;
        return 0;
  } else {
    return -1;
  }
}

Now forget about the potential type errors and so forth here. What is the memory behavior if this snippet? Every operation on the stack must effect both the element at the stack pointer, and the stack size counter! This means that every fundamental instruction of the stack machine must suffer the read and write delays of whatever memory is used to store the stack data at least four times!

Further examination of stack machine code shows that a stack machine will only operate upon the top few elements of the stack at a time. Remember why we had to put DRAM off chip? Because it was big! Here we clearly only need a small memory, which suggests that one can optimize out most of these painfully slow memory accesses by maintaining the top N (say 16 or 32) elements of the stack in extremely fast storage on the processor. Let's call this storage registers. We'll also store the stack counter and the program counter in registers so that the processor doesn't have to chase back and forth to memory for those either.

Okay great. Now rather than accessing memory all the time we can cache the top 32 or so elements of the stack in registers which are comparatively blazing fast, and we can describe some hardware which will automagically moves elements back and forth from the register stack to the DRAM stack every few cycles to ensure that the register stack stays at about 50% utilization.

Why do we need this special piece of hardware to keep the stack cache full? So that we (hopefully) never have to endure the insufferable DRAM read/write latency as part of executing a single instruction because both operands already existed on the comparatively extremely fast register "stack".

Now we can make this more efficient.. remember that we have to play all sorts of tricks with duplicating values on the stack to get the top elements of the stack to be the values we want? Well there is very little reason for that now when we have this extremely fast cache of the top 16 or more values on the stack, in fact we could just make our operations parametric on which of the cache registers we read from and write to thus saving ourselves and the compiler a lot of leg work in managing the data stack.

That's it! Now we've invented the register machine, or at least a simple one. Previously, we had a "zero operand machine", in that none of the operations had parameters encoded in the instruction. Pop for instance clearly manipulates the top element of the stack. Add was defined to use the top two elements of the stack. The only funky one was push, which read a value from the instruction stream and then wrote to the stack. With a register machine however, there is no clear order to the registers. This is a benefit, because it means that we can trivially express operations that store to anywhere on the "stack" or a load from anywhere on the stack.

This also means that we have to qualify all our operations with where they fetch data from and where they store their results to. To model this, I will describe what is called a four operand machine. The first operand is the ID of the register to which the result will be stored, the second and third operands are the sources of the arguments to be computed over and the fourth argument is a small inline constant in the instruction. This means that rather than being simply an instruction constant our instructions are now a sequence of five values, being the instruction dispatch constant followed by our four operands.

Now, we need a new simulator armed with some extra instruction execution logic and a system to write back to more than just the stack. Unfortunately we can't just run out and design an instruction set. Well.. we could but we'd wind up with something gnarly like the DCPU-16 which is complicated and difficult both to describe and implement.

The universe of instruction set architectures are typically divided into two categories: RISC and CISC. CISC or "Complex Instruction Set Architecture" is a catch all terms for older architectures such as VAX which tried to offer significant amounts of programmer utility out of the box. One famous example of a CISC instruction is the VAX "compute polynomial" instruction, which could evaluate a quadratic of the form Ax^2 + Bx + C, for A, B, C and X specified as register arguments. CISC instructions rarely finish in a single machine execution cycle.

RISC instructions on the other hand are the epitome of simplicity. Add. Subtract. Multiply. Division which takes more than one cycle to compute. Bitshifting. This style of design tends to have a handful of very simple instruction encodings which can be assembled into a standard libraries of CISC-like operations rather than baking the complexity of such libraries into the hardware. In the late 80s and 90s RISC machines took over from CISC machines for design reasons which will be discussed at length later in this series.

For the rest of this series, I shall be using an invented instruction set which I hereby name Batbridge. Batbridge is a RISC style, 32 bit, word addressed little endian machine architecture with a standard four operand instruction encoding:

[<op> <destination register> <src a> <src b> <11 bit signed int>]

This encodes to a single 32 bit word by (ab)use of C bit structs as

1
2
3
4
5
6
7
struct {
  int op:5;
  int dst:6;
  int a:6;
  int b:6;
  int i:11
} opcode;

So what will a simulator for this architecture look like? Well first off, we need an instruction set. the Batbridge spec, will serve this purpose well as it is minimal and simple to encode and decode.

Architecturally, our register machine simulator will share a fair bit with the stack machine similar of my previous post in that we enter the machine from a run function which operates by naively applying a step function until the machine stops. However, this step is gonna look a little different.

(defn step
  "Sequentially applies each phase of execution to a single processor
  state, returning an updated state representing a single cycle of
  computation's effects on the processor."

  [state]
  (-> state
      fetch
      decode
      execute
      writeback))

Why do I break a single cycle up into a bunch of steps? Because now we have a whole lot more book keeping to do. Every cycle we need to get or fetch an instruction from memory. Then we need to decode it from its binary representation to an internal representation not too unlike the vector instruction which I show above because that's easier for us to reason about and for the program to work with. Obviously we have to execute the instruction, which will handle all the overhead of figuring out what values are supposed to be used and performing a computation. And finally we need to change the state of the processor, which I shall do by "writing back" to memory and the register array in writeback.

fetch is pretty cute... it just reads an instruction from memory and stores it in the processor state where decode can find it.

(defn fetch
  "Accesses the value of the PC register, fetches the instruction
  there from memory and sets the :fetch value key in the
  state. Increments the PC by the instruction width. Returns an
  updated state."

  [processor]
  (let [pc (get-in processor [:registers 31])
        icode (get-in processor [:memory pc])
        npc (+ pc 4)]
    (println "[fetch ]" pc "->" icode)
    (-> processor
        (assoc-in [:registers 31] npc)
        (assoc :fetch {:icode icode :pc npc}))))

Decode is actually fairly complicated, because we have to do a bunch of legwork to translate a single byte encoded instruction into something reasonable. Most of the leg work will be done by a decoder helper which lets us cheat and write a really simple stage:

(defn decode
  "Decodes the instruction fetched by the fetch stage and makes sure
  that it is formatted for the benefit of the execute stage."

  [processor]
  (let [{:keys [icode pc] :as fetch}
            (get processor :fetch {:icode isa/vec-no-op
                                   :pc -1})]
    (println "[decode ]" icode)
    (println "[decode ]" (-> icode
                               isa/decode-instr
                               common/fmt-instr))
    (as-> icode v
          (isa/decode-instr v)
          (assoc v :pc pc)
          (assoc processor :decode v))))

Okay that's fetch and decode down... now we just need execute and writeback and we'll have a working processor more or less. Execute really only has to do three jobs. Since decode has already figured out what the instruction is, execute needs to get the data to which the operation will be applied and invoke the appropriate op. Remember that back in the stack machine we used a simple map to translate opcodes to executable functions? We'll pull that trick again...

(defn execute
  "Indexes into the opcode->fn map to fetch the implementation of the
  last decoded instruction. Decodes the parameter values, and applies
  the implementation function to the parameters. Sets the :execute key
  with a state update directive which can use. Returns the updated
  processor state."

  [processor]
  (let [{:keys [icode a b d i pc] :as decode}
        (get processor :decode
             isa/map-no-op)
        srca (common/register->val processor a pc i)
        srcb (common/register->val processor b pc i)]
    (println "[execute ]" decode)
    (as-> icode v
          (get isa/opcode->fn v) ;; maps instructions to functions
          (v srca srcb processor d)
          (common/upgrade-writeback-command v)
          (assoc v :pc (get-in processor [:decode :pc]))
          (assoc processor :execute v))))

Okay, so ultimately execute emits a command for the writeback stage which may indicate some change is required in the processor state. Since this machine can only function as intended if these commands go through we need to handle them with writeback.

(defn writeback
  "Pulls a writeback directive out of the processor state, and
  performs the indicated update on the processor state. Update command
  have been restructured and are now maps
  {:dst (U :registers :halt :memory) :addr Int :val Int}."

  [processor]
  (let [directive              (get processor :execute isa/writeback-no-op)
        {:keys [dst addr val]} directive]
    (println "[writeback]" directive)
    (cond ;; special case to stop the show
          (= :halt dst)
            (assoc processor :halted true)
          
          ;; special case for hex code printing
          (and (= :registers dst)
               (= 29 addr))
            (do (when-not (zero? val)
                  (print (format "0x%X" (char val))))
                processor)

          ;; special case for printing
          (and (= :registers dst)
               (= 30 addr))
            (do (when-not (zero? val)
                  (print (char val)))
                processor)
            
          ;; special case for branching as we must flush the pipeline
          (and (= :registers dst)
               (= 31 addr))
            (do (println "[writeback] flushing pipeline!")
                (-> processor
                    (assoc-in [:registers addr] val)))

          true
            (assoc-in processor [dst addr] val))))

And that's all we need! There is a little bit of plumbing which I have hidden, but all the high points have been presented here. The full simulator which I have just walked through is available here on github with some minor alterations if you actually want to build and run it.

Next time we will resume our unending pursuit of faster clock speeds by examining caching strategies and their impact on the performance of high memory use processors.

^d