A single cycle processor

This time last year, I had the pleasure of taking a class in computer architecture with Ahmed Geith of IBM Research, ATX. Unlike my previous exposures to computer architecture, CS352H didn’t just present Intel x86 as the ideal architecture handed down from on high. Instead we approached computer architecture from first principles and build a modern processor from the ground up. In my hoby work developing the toothpick assembler, I use the BatBridge architecture which we invented for CS352H as my testbed because it has a simple MIPS like instruction set and I’m familiar with its binary layout. In this series of posts I’m going to walk through computer architecture and the BatBridge ISA which we developed in class, showing several Clojure simulator implementations which will grow from the single cycle design with which most programmers are familiar to the out of order and superscalar reality which many of us unknowingly use on a day-to-day basis.

This series serves a quintuple purpose. First of all, it is an exercise for myself in producing a reasonably written article series, a personal first. Second of all, it is designed to make use of the toothpick assembler, thus forcing me to develop that library through use. Third, to force myself to remember exactly how out of order processors work. Fourth, to make myself build a working OoOP simulator because the one I built in class never quite worked. And finally (and most importantly), this is my attempt to share some of the really cool stuff that hardware does to achieve the blinding speeds we all take for granted these days.

So, without further ado, lets get started!

One of the classical assignments for a CS freshman, and a surprisingly recurring project in the programming community at large is the implementation of a simple interpreter. Simple interpreters fetch an “instruction” by reading some resource, compare the value of that “instruction” to known constants, execute the indicated operation and repeat the process until terminated or a halt condition is reached. This structure is referred to as a “single cycle” interpreter, because one instruction is handled in a single pass through the execution loop.

Great, but what does such a thing look like in hardware?

A stack machine is the obvious model of a processor, in which operations taken from a linearly indexable sequence of instructions are applied to a stack of arguments. Each operation takes the top N (typically 2) values off of the stack, and puts M (typically 1) elements back on the stack.

A stack machine will have the obvious arithmetic operations, but also needs an unconditional push or load instruction which allows the stack to be populated with values. push will also typically allow copying a value from somewhere down the stack to the top. Stack machines also need a swap or exchange which allows the top two values to be swapped. A pop operation which allows the top of the stack to be thrown away is also needed. These two operations are required because a stack machine quickly clutters the top of the stack with the right values in the wrong order. These instructions allow for the stack to be managed, keeping operands in the order desired by the programmer.

Unfortunately, stack machines aren’t fast even by the standards of single cycle processors. This is because a stack machine must access the stack at least twice every cycle. Because a stack must be able to grow, the stack resides in part in main memory, meaning that our admittedly slow single cycle processor must suffer the latency of memory three whole times in the best case! Since memory will typically run three orders of magnitude slower than the processor, this means that a stack machine is likely wasting thousands of cycles on every instruction!

Dang do we need something faster…

The register machine is the model of the processor most widely in use. The intuition of the register machine is that a stack machine spends a lot of time managing the stack, which is largely processor time wasted. If operations took “parameters”, being special fast storage addresses for values, rather than assuming that the parameters are the first N values of the stack then we could escape all this mess of having to swap and exchange to keep the values we want at the top of the stack! These special parameter storage addresses are called “registers”. While the main processor memory may operate at a frequency thousands of times slower than the processor, the registers operate at processor speed! This means no more (or at least way less) waiting for that small set of values that we really want to compute with. When we have to access memory for reads or writes we still are forced to encounter that painful, evil, memory latency but at least for now it’s infrequent because we can get enough data into the registers to compute for many cycles without having to read more data in.

Enough talk. Lets build a register machine! In Clojure of course. A processor is a hunk of state that we want to snapshot at the start and end of each execution cycle. We also need instructions to operate over.. This implies that we want a top-level function something like this

Fig. 1 - The Big Loop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
;; Define the Processor structure, and sketch some functions to
;; operate thereon 
;;--------------------------------------------------------------------
;; Processor:
;; {; basic state representation
;;  :registers -> {<name>  -> Integer}
;;  :memory    -> {Integer -> (U Instruction Integer)}
;;  :halted    -> Boolean
;;  ; these keys are set by the processor stages
;;  :fetch     -> vector Instruction
;;  :decode    -> map Instruction
;;  :execute   -> Writeback command
;; }

;; Note that this is a Von Neuman architecture machine, with a single
;; shared memory for both instructions and data. This could play hell
;; with our simulator, because we're going to cheat and use a symbolic
;; instruction representation for now. Eventually we'll transition to
;; the Toothpick assembler library and get all the simulators the same
;; standard byte code.

(defn step
  "Sequentially applies each phase of execution to a single processor
  state, returning an updated state representing a single cycle of
  computation's effects on the processor."
  [state]
  (-> state
      (fetch)
      (decode)
      (execute)
      (writeback)))

(defn -main
  "Creates a processor state with a map \"instructions\" from
  4-aligned addresses to instruction representations, and executes
  this state starting from memory address 0."
  [instructions]
  (loop [state {:memory instructions
                :registers {31 0}}]
    (if-not (halted? state)
      (recur (step state))
      state)))

This is awesome! It’s totally instruction representation agnostic, we just need to define the halted? predicate which indicates when the machine has stopped, and the fetch, decode, execute and writeback operations. Then we’ll have a toy single cycle simulator. However, before we can define any of this we have to settle on a representation for instructions and the internal state of the machine.

My ISA

Instruction Set Architectures (ISAs) are the set of instructions which a given machine (virtual or otherwise) can run. Because this model of a CPU is bound to a single “operation” every loop step it would seem reasonable to try and do fairly sophisticated things in a single instruction. The VAX instruction set is famous for its “compute polynomial” single instruction. For reasons I’ll discuss next time this style of computing fell out of favor quickly, but for now bear with me and assume that small is good.

On this basis of “small is beautiful” I’m going to present a machine with five operations: addition, subtraction, halt, memory read and memory write. Since we’re working in Clojure I’ll present this architecture in Clojure notation so that it’ll translate directly into the simulator above and allow me to keep this post short. ish.

My basic instruction representation shall be a tuple of four elements: a symbolic instruction, three registers, and a constant value.

Fig. 2 - The Big ISA Spec

 - The symbolic instruction shall be the human legible name of the
   instruction which the processor will execute.
   
 - The first register shall be the destination to which we will store
   the result. We shall make an exception in the case of the store
   operation: the first register will be the source of the value
   being stored.
   
 - The second and third registers shall be the locations of the
   arguments the processor will use for each operation. These must be
   register specifiers, symbolic or integer.
   
 - This machine shall have 32 registers, numbered 0 to 31
 
 - Register 31 shall be the address for the _next_
   instruction. Writing to this register changes the address of the
   next instruction to be fetched. I shall refer to this value as
   `:r_PC` for clarity.
   
 - Register 0 shall read to the value 0. Always. We shall make a
   special case of writing to the zero register to print the value
   written as a character to standard output. This is a hack for
   simulator testing. I shall refer to this value as `:r_ZERO` for
   clarity.
   
 - Register 30 shall read to be the value of the 5th instruction part,
   the numeric literal. I shall refer to this value as `:r_IMM` for
   clarity.
   
 - A single instruction shall be treated as though it had a width of
   32 bits or 4 bytes. This means that adding 4 to the PC skips the
   next instruction. Subtracting 4 jumps back to the current
   instruction, because the value of `:r_PC` is the address of the
   _next_ instruction, being the address of the current instruction
   plus 4.

Okay, so with those definitions we can write some sample opcodes

Fig 3. - ISA Demo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
;; halt - stops the processor and our simulator
[:hlt 0 0 0 0]

;; adds :r_IMM to itself, stores 10 to register 1
[:add 1 :r_IMM :r_IMM 5]

;; jumps back to this instruction, spinning forever
[:sub :r_PC :r_PC :r_IMM 4]

;; loads the address (+ a (* 4 x)) to register dst
;;    DST A   X   LIT
[:ld  1   2   3   0  ]

;; stores the value at SRC to the address (+ a (* 4 x))
;;    SRC A   X   LIT
[:st  1   2   3   0  ]

Still with me? Good. I didn’t manage to scare you off with that wall of specification nonsense. Now lets bust out this interpreter and have some fun with it! So we had a couple of stages to implement for our single cycle desgin: fetch, decode, execute, and writeback. Let’s take it from the top with fetch. All fetch has to do is read an “instruction” from memory and load it into the :fetch key in the processor so that the decode operation knows what to look at.

Fig. 4 - Fetch

1
2
3
4
5
6
(defn fetch [processor]
  (let [pc (get-in processor [:registers 31])
        icode (get-in processor [:memory pc])]
    (-> processor
        (update-in [:registers 31] #(+ 4 %1))
        (assoc :fetch icode))))

Now for our processor right now, decode doesn’t have to do a whole lot. In a real processor the decode stage would do some major heavy lifting to convert a word (for us a hunk of four bytes) into an opcode and it’s required operands. For simple machines this could be as trivial as a hardwired breakout but for “real” machines with variable length instructions and variable instruction formats which gets hard and expensive in terms of hardware fast. We’ll keep it simple

Fig. 5 - Decode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
(def register-symbol-map
  (-> (reduce (fn [m v]
                (assoc m v v))
              {} (range 30))
      (assoc :r_PC 31)
      (assoc :r_IMM 30)
      (assoc :r_ZERO 0)))

(defn decode
  "Dummy decode which translates a vector literal to an icode map. Not
  really important now, but it'll be nice later."
  [processor]
  (let [icode (:fetch processor)
        v (partial nth icode)]
    (assoc processor :decode
           {:icode (v 0)
            :dst   (register-symbol-map (v 1))
            :srca  (register-symbol-map (v 2))
            :srcb  (register-symbol-map (v 3))
            :lit   (v 4)})))

That wasn’t so bad, now we can fetch and decode an instruction! All that’s left is execute and writeback. Execute is gonna be comparatively complicated. But since we’re compartmentalizing execution from writeback where many single cycle interpreters do them together, both operations should still be comparatively short. Execute is going to take the decode value, interpret the two source arguments to get literal values, perform the computation specified and create a “writeback command” which will indicate which update writeback needs to perform.

Fig. 6 - Execute

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
(defn register->val [reg processor]
  (case reg
    (:r_ZERO 0) 0
    (:r_IMM 30) (:lit (:decode processor))
    (:r_PC 31)  ((:registers processor) 31)
    (get-in processor [:registers reg] 0)))

(defn execute [processor]
  (let [icode (:decode processor)
        srca (register->val (:srca icode) processor)
        srcb (register->val (:srcb icode) processor)
        dst  (:dst icode)
        icode (:icode icode)]
    (assoc processor :execute
           (case icode
             (:add :sub :ld)
             [:reg dst
              (({:add (fn [x y _] (+ x y))
                 :sub (fn [x y _] (- x y))
                 :ld  (fn [x y s]
                        (get-in s [:memory (+ x (* 4 y))]))}
                icode)
               srca srcb processor)]
             
             (:st) [:mem (+ srca (* 4 srcb)) dst]
             (:hlt) :halt
             :else (->> icode
                        (str "Unsupported icode: ")
                        (Exception.)
                        (throw))))))

Okay. That was somewhat horific and no doubt I’ll clean it up some for readability. The trick to this execute implementation is that I use a map from icode symbols to functions in order to achieve computation. In the future we’ll clean that out, but for now it exists inline and is sorta messy. This brings us to the final stage: Writeback!

Fig. 7 - Writeback

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
(defn writeback [processor]
  (let [directive (:execute processor)]
    (cond ;; special case to stop the show
          (= :halt directive)
            (assoc processor :halted true)
          
          ;; special case for printing
          (and (= :reg (first directive))
               (= 0 (second directive)))
            (do (print (char (nth directive 2)))
                processor)

          true
            (assoc-in processor [({:mem :memory
                                   :reg :registers}
                                  (first directive))
                                 (second directive)]
                      (nth directive 2)))))

And a couple more snippets for stuff I forgot along the way…

Fig. 8 - Random snippets

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
(defn halted?
  "The predicate used to make the simulator stop."
  [state]
  (:halted state))

(defn seq->instrs
  "Creates a map from the instruction sequence, aligning the
  instructions sequentially at an offset of 4 starting at 0."
  [seq]
  (zipmap (range 0 (* 4 (count seq)) 4)
          seq))

(defn bprn
  "(prn <str>) in Clojure is equivalent to a sequence of additions
  storing to the zero register with a sum value equal to the character
  code of each successive character. This function computes that
  sequence of additions for any input string."
  [str]
  (mapv (fn [x] [:add :r_ZERO :r_IMM 0 (int x)])
        str))

Totally wicked! Now lets define an endless hello world just to prove that we can do arithmetic and print:

Fig. 9 - Demo run!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
user> (-main
       (seq->instrs
        (conj (bprn "Hello, World!\n")
              [:add :r_PC :r_ZERO :r_ZERO 0])))

Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, ^c
user>

HAHA VICTORY IS MINE! ahem. ours. I meant ours. Well that was pretty cool… we got a full working single cycle interpreter for something that reasonably approximates assembly language. It’s a little abstracted (you could even argue that I’m cheating) in that it’s ready for a binary instruction format and for pipelining.

For those of you who want to mess with this, the source code is pasted here on refheap so that you don’t have to edit out my line numbering. I’ve also got the code over here on github if you want to watch how it evolves over the next couple of posts.

However, what I’ve presented here is nothing you haven’t seen before. Also it’s entirely useless. I have no control structure besides unconditional jumps yet! Otherwise I was going to give a demo of computing the 42nd term in the fibonacci sequence… :-/. Next time, we’ll move on to the pipelined architecture and the really interesting ways that pipeline processors complicate this simple picture. We’ll also soup up this architecture with some cool stuff like a more realistic instruction representation (binary bytecode!) and a richer instruction set.

^d