Toothpick: A theory of bytecode machines

In working on my Batbridge simulators and other adventures in computing hardware, one of the problems I have repeatedly encountered is the need to generate bit encodings of symbolic instructions. When working with "well known" architectures with existing tooling support, one can usually find an "off the shelf" assembler either in the GCC collection or elsewhere. However when you're working against a custom instruction set you're on your own for tooling.

So lets invent yet another assembler!

An assembler is simply a program that describes how to construct a binary instruction encoding from a set of symbolic human-writable mnemonics representing data manipulating directives to computing hardware. add, sub[tract], mul[tiply] and jump are good examples of such mnemonics. Each operation which a given "machine" (hardware/software interpreter) will accept is typically specified as a sequence of bits and bit encoded fields (sometimes abbreviated in octal or hexadecimal notation). So Batbridge for instance specifies that the add instruction is the bit string 110000tttttaaaaabbbbbiiiiiiiiiii where the prefix 110000 or 0x30 indicates the addition operation, the bits named t encode a number t∈[0..30] being the register to which the result of the addition will be stored, the bits named a subject to the same constraints as t name the register from which the "left" operand will be drawn and the b bits name the "right" operand same as the t and a operands. i is an arbitrary signed 10 bit integer (sign is the 11th bit).

Sitting down with a specification, you or I could construct a set of functions from values appropriately constrained to bit encoded instructions as laid out in an instruction set specification. However in doing such an assembler implementation, you will discover that you are simply encoding in the programming language of your choice a translation table laid out in full by the instruction set documentation. However realizing that the assembler which we are constructing by hand represents a printer to the bytecode machine's reader, we can think of bytecodes as a language in the same sense that any other set of strings and productions constitutes a language.

A Theory of Bytecode Machines

It happens that we have good formal theories about languages and their structures. If we stop and squint, we can see that the individual opcodes are analogous to terminals or verbs. If we allow ourselves to model opcodes as words in a language, then a program (a production of words) is a sentence for which we can establish a grammar. For instance a ELF formatted program could be considered to be a sequence of blocks of instructions (verbs) defined in terms of words (opcodes) and formatted with an appropriate header/footer (the ELF file structure). As these are all linguistic constructs, a program capable of generating these values must be a direct production of the specification which lays out these productions in the same way that a lex/yacc lexer parser pair is a mechanical production of the lexing and parsing rules which define a computer language.

This leads to the idea that, just as we now rarely write parsers and lexers preferring to derive them automatically from specifications of the languages we wish them to process since a as suggested above a bytecode is simply a language on bit strings rather than ASCII or Unicode characters the consequent that we can use the same class of tools to generate assemblers and disassemblers as we use to generate parsers and printers should be obvious. We just need to construct formal languages describing the instructions our machines accept.

An Assembler Generator

Just as we have theories about "interesting" sets of languages and the parser techniques which may be applied to efficiently recognize and process them, so too rather than being able to automatically assemble arbitrary programs to arbitrary bytecodes we must first recognize that as a bytecode is a language on bit strings and as there exist languages which require a Turing machine to recognize in addition to uncomputable languages consequently one could invent an "interesting" (but I will argue in a minute silly) bytecode for which no assembler or recognizer can be automatically generated.

So what features define "interesting" bytecodes? In hardware terms as I discuss elsewhere memories and lookup tables are just about the most expensive thing you can build in terms of chip area and time. As a result, the overwhelming majority of computer architectures are not what I shall call "micro-programmable", that is users cannot give meaning to new sequences of bits and rather interact with the computer by chaining together a few bit-verbs or opcodes which are built into the chip. There do exist such micro-programmable computers, FPGAs, but they tend to be special purpose prototyping machines and are not in wide usage compared to fixed instruction set machines.

Another characteristic of "interesting" bytecodes is in the name, bytes or words. Many years have gone by since bit addressed machines were last built with commercial intent. Instead commercial machines and off the shelf memories tend to address "blocks" of bits or "lines" being strings a power of two in length which may be divided into substrings smaller powers of two in length as the user may see fit at no additional hardware complexity cost by simply selecting bits through selecting predefined subdivisions of lines known as "words" or using bit masking and shifting to select strings smaller than a single word. Strings larger than a single word must be dealt with by using multiple operations on words.

This theory of a fixed word size, fixed instruction set machine characterizes the overwhelming majority of instruction sets and machines designed and built by Intel, ARM, Symbolics, Thinking Machines, and others over the past decades due mainly to the ease of implementation and efficiency of such designs.

So what does this theory about a "useful" bytecode machine buy us? It implies that we can describe a "useful" machine as a bitstring length denoting word size. Endianness must also be considered. Then given an enumeration of the various opcodes and how to generate them as bitstrings, you can completely describe an "interesting" instruction set as characterized above.

Assemblers are simply programs that deal with translating instructions set mnemonics writable by a programmer or more often than not a compiler to bytecodes which a machine can execute. The details of computing label addresses are entirely for the most part independent of the specific architecture in so much as they are determined by properties of the specific target bytecode described above. This suggests that there must exist a function

(assemble p ← Platform c ← Codes)

Which makes sense if you consider a single target assembler to be the partial application of such an assembler to a platform.

Talk is cheap, show me the code

First things first, we need a way to represent an "interesting" instruction set.

(deftype Architecture
    "A structure representing an \"interesting\" bytecode architecture
    from which a parser or a generator can be computed."
  [word-width         ;; ← Num, how many bits in a word
   pointer-resolution ;; ← Num, how many bits forwards does *(x+1) go
   opcodes            ;; ← Map [Memonic → Opcode], opcode formatting rules
   ])

So what do individual opcodes look like? We're gonna need some bit vector formatting engine to do real code bytecode generation with.

I will define an opcode to be a bitstring of fixed length (this is important, there are instruction sets with variable length "opcodes" however as the different length forms have different operational semantics I shall define them to be different operations and declare the instruction set architects silly people) composed of a sequence of concatenated parameters or fields and constants which may be either signed or unsigned. Fields are named or anonymous. The set of named field names in a single opcode must have no repetitions. We want to be able to write something like the following

;; IFLT 0x20  100000 _____ aaaaa bbbbb iiiiiiiiiii
;; execute the next instruction IFF (< a b)
(opcode :iflt
  (const-field            :icode 6  0x20)
  (enforced-const-field   :_     5  0)
  (parameter-field        :a     5  register?)
  (parameter-field        :b     5  register?)
  (signed-parameter-field :i     11 literal?))

So in this example instruction from Batbridge, we define an instruction as the ordered bit string where the bits [0..5] are named :icode and fixed with the constant value of 0x20. We then have an anonymous field 5 bits long taking the bits [6..10] which we force to have the value of 0 since they are defined to be ignored by the instruction set. We then have a pair of (unsigned) 5-bit registers, which must satisfy some guard predicate ensuring that the value in question fits within the value domain which the instruction set allows for registers. Finally we have a signed field 11 bits long, which again is guarded with a value domain predicate. This notation encodes the bit format string, as well as the value domains for each element of the bit format string in a form that can be easily inspected by either an assembler or a disassembler program.

So what does that expression build out to as a data structure?

{:width 32,
 :params (:_ :a :b :i),
 :fields ({:offset 26,
           :width 6,
           :name :icode,
           :type :const,
           :value 35}
          {:offset 21,
           :width 5,
           :name :_,
           :type :enforced-const,
           :value 0}
          {:offset 16,
           :width 5,
           :name :a,
           :type :unsigned-field,
           :pred #<batbridge$register_QMARK_ toothpick.isa.batbridge$register_QMARK_@4bd4095>}
          {:offset 11,
           :width 5,
           :name :b,
           :type :unsigned-field,
           :pred #<batbridge$register_QMARK_ toothpick.isa.batbridge$register_QMARK_@4bd4095>}
          {:offset 0,
           :width 11,
           :name :i,
           :type :signed-field,
           :pred #<batbridge$literal_QMARK_ toothpick.isa.batbridge$literal_QMARK_@38334886>})}

Given this datastructure, and a sequence (opcode . params), we can trivially compute a map (zipmap (:params (get isa opcode)) params), and then fold right over the fields of the instruction computing each field then shifting and anding it into place in the bit vector to construct the bit encoding of the instruction.

As implemented this is a little fragile because it assumes that the computer running the assembler assembler has a larger or equal integer size than the target, as attempting to construct an int aka bit vector that's too large will yield... interesting issues. Future fixes are to use a real sequence of bits, or to rewrite this logic in terms of multiplication by two and addition.

Deriving a parser for this structure is also (fairly) straightforwards, one simply can simply generate an alternation matcher on bitstring matchers matching the individual instructions. "interesting" ISAs tend to put the opcode specification in the "low" bits of each instruction word so as a we can simply generate a jump table by masking on the least significant bits but that's not implemented yet.

Also this representation falls into jneen's type tagging pitfall, but this is old code and implementation details thereof so I'll forgive myself and fix it later.

We can then define an "architecture" by constructing a composite map as above having a word size and pointer interval specification and a map from keywords naming instructions to many such instruction maps.

So, we can describe an instruction stream as a sequence of instructions, and we can assemble individual instructions by interpreting from their representation above, so if we wanted to encode a sample program

[[:label :top]
 [:op [:add [:param [:register 0]]
            [:param [:const 0]]
            [:param [:const 1]]]]
 [:label 1]
 [:op [:add [:param [:register 1]]
            [:param [:const 0]]
            [:param [:const 2]]]]
 [:label 2]
 [:op [:add [:param [:register 0]]
            [:param [:register 0]]
            [:param [:register 1]]]]
 [:label 3]]

We can simply foldr a label location computing operation since all of our opcodes are defined to be fixed length even if they are abstractly variable length, then in a second pass we assemble each instruction against the label location context and the instruction set, giving us a sequence of bytecodes. Done! ∎

This is essentially my Toothpick project, which seeks to provide exactly this meta-assembler facility and has already proven its worth to me in automating the generation of assemblers for the various toy architectures with which I've worked at school. It's not done yet and as noted above it needs some spit and shoeshine but I think it represents a remarkably simple and powerful idea.

The real power of this approach is that, extended, this can enable the automatic generation of LLVM backends! If an instruction specification were to include its operational semantics written in terms of LLVM instructions one could imagine a trivial derived LLVM backed which sequentially scans emitted LLVM assembly matching out the platform translation of the instruction "at point" and then assembling the resulting instruction stream as above. Since LLVM already provides a massively portable C compiler infrastructure, this means that given only a formal platform specification one could construct a mediocre C compiler only from the platform specification.

To take this idea even further, one could also imagine generating a platform simulator from either symbolic instructions or bit encoded instructions from only the augmented ISA with semantics instructions.

Generating muxing logic in Verilog or VHDL should also be trivial from this structure.

Generate all the things! Correctness by specification & construction! Toy architectures for everyone!

^d