Building a cache hierarchy

Previously, we progressed from a single cycle stack machine to a register machine by attempting to escape the performance penalties brought upon us by attempting to access memory frequently. On this basis we reinvented the register machine as a mechanism which allows a programmer to escape the latency of accessing DRAM whenever possible by allowing for the manipulation of a small "working set" of data in "registers" without reading and writing to memory.

However, before we do anything about this, lets start introducing the latency suck which is memory access to our simulator! That's right kids, I'm about to deliberately make my code run slower :P Don't Try This At Home™ In the single cycle processor I used the symbol batbridge.common/get-memory to read a value from the memory state of the processor. The definition of this symbol was pretty simple:

(defn get-memory
  "Accesses an address in a processor state's memory, returning the
  value there. Defaults to 0, so the processor will halt if it jumps
  into unset instructions."

  [p addr]
  (get-in p [:memory addr] 0))

The trouble with this is that while it works perfectly as a simulation mechanism it introduces exactly no memory access penalty! Doing scientific measurements on my dev box, a single iteration of the single cycle processor simulator takes about 0.306ms to execute a single cycle. According to wikipedia, in 1998 a memory read could take on average 50ns. In 1998 top of the line processors ran at 300MHZ, or a cycle time of 33ns, so lets add a delay of 3ms to our memory access and call it to scale. [edit] Since writing this, an awesome source for approximate latency numbers has appeared!

(defn get-memory
  "Accesses an address in a processor state's memory, returning the
  value there. Defaults to 0, so the processor will halt if it jumps
  into unset instructions."

  [p addr]
  (Thread/sleep 3)
  (get-in p [:memory addr] 0))

And for good measure I'll throw the same 3ms delay on the write-memory operation just to drive the point that memory access is evil home within the bounds of our simulator.

So memory sucks, and now our simulator is paying the price now. Every cycle this processor gets hammered first by accessing memory to read an instruction, and then in the worst case by accessing it again for either a memory read or write! This means that our cycle time is at least about 3.3ms, but could be as bad as 6.3ms in a double access case. What can we do about this?

What if we built a large-ish cache, which stores say a 64 entry subset of main memory? While we introduce some added main memory read and write latency due to having to interact with a cache controller, for the small working set of values in the cache our access latency can be greatly decreased! In fact, our access time decreases in proportion to the fraction of the time that we "hit" in our small cache storage. Say that the cache is 10x faster than main memory thanks to being smaller. This means that if say we had a cache for all the instructions (which we have to fetch every cycle!) that our threshold minimum cycle time is cut from 3.3ms to 0.6ms!

So lets take a real quick first cut at a reasonable cache model. We can only achieve this factor of 10 speedup in memory access when we can read data from the cache rather than from main memory. This means that the performance of the simulator hinges in no small part on the frequency with which we can access the cache rather than main memory, so we must be judicious in what we choose to store in the cache.

Cache control strategies

Caching is its own entire research field in computer architecture. Modern processors typically use a three level cache to make main memory access faster, and feature special caches for frequently accessed structures such as the address range in which instructions are stored. In all this, cache management is absolutely critical as ensuring that the data with which the processor will interact is cached can represent a factor of 10 speedup or more as demonstrated earlier.

One classic cache management algorithm is the "least recently used" pattern. In LRU caches, the cache is filled as values which are not in the cache are accessed. When the cache is full and a new value which is not in the cache is accessed, the value in the cache which has not been used for the longest is declared dead and "evicted" in favor of the new cache value.

Another highly effective and simple strategy is least frequently used, which when the cache is full identifies the value which has been used least and removes that from the cache. This is the algorithm which I will implement as it achieves slightly better results than LRU.

Cache simulation

Now, lets add a simple cache to the simulator. Previously, the cache was a simple map from memory addresses to values. Now, lets define a cache structure:

;; caches are maps:
{:map     (Atom {:store {<address> <value>}
                 :meta    {<address> <meta>}
                 :size    <value>})
 :latency <value>
 :else    (U <cache> <None>)
 }

Note that this cache structure is recursive. That is, it allows us to nest several caches! This means that we can play Russian dolls with caches, hiding smaller, faster caches underneath larger, slower caches. Why is this valuable? Because if we can manage the caches well it means that we need never incur the full latency of accessing main memory. Say we had a three tier cache system. The first tier takes 1ms to access, the second 10ms and the 3rd (main memory) takes 100ms. If each cache has a 90% hit rate, that is we do a good job of managing our caches contents and only rarely have to go outside of a cache, what is out average access latency?

user> (+ (* 0.9 1)
         (* 0.1 (+ (* 0.9 10)
                   (* 0.1 100))))
2.8000000000000003

On average, we only need to spend 2.8ms accessing a value, no matter where it is in our system's memory! Compared to the 10ms and 100ms times of the outer memories that is smoking fast! Now, this depends on having a sufficiently large cache that a 90% hit rate is possible, and on having a cache controller capable of achieving this kind of efficiency.

Okay, so what do we do when we find a value in the cache? We need to inform the cache that the value was used, and that it should be kept in the cache.

(defn ninc [x] (if x (inc x) 1))

(defn cache-hit!
  "Executes the cache metadata side-effects associated with hitting in
  a cache. In this case we just inc the metadata value for that
  particular key. Note that this operation has no meaningful return
  value and is entirely for side-effects upon the cache atom."

  [cache key]
  (-> (:map cache)
      (swap! update-in [:meta key] ninc)))

Not too bad... now we need a mechanism for choosing an element (called a victim) to remove from the cache when the cache gets full.

(defn cache-evict!
  "Executes the cache metadata side-effects associated with missing an
  element in a cache. In this particular implementation we select the
  minimum key/value pair in the metadata map and evict that key from
  both the metadata and the value maps. Note that this operation
  returns _no meaningful value_ and is entirely for side-effects upon
  the cache atom."

  [cache]
  (let [atom (:map cache)
        {:keys [store meta size]} @atom
        victim (->> (map-invert store) ;; reverse the mapping to be val -> key
                    (keys)
                    (apply min)
                    (get store))]
    (when (>= (count store) size)
      (do (swap! atom update-in [:store] dissoc victim)
          (swap! atom update-in [:meta] dissoc victim)))))

Okay, so we have a base case for removing things from the cache, but we need to be able to insert values into the cache first:

(defn cache-install!
  "Installs a key/value pair in a cache, performing the appropriate
  side-effects. This function is entirely for side-effects and returns
  no meaningful value."
  [cache key v]
  (swap! (:map cache) update-in [:store] assoc key v)
  (swap! (:map cache) update-in [:meta] assoc key 1))

Now how do we access a value in a cache? We have to incur the cache latency, see if the cache contains a value and if not search the next cache recursively. When we find the key, we have to call cache-hit!, and otherwise we have to call cache-install! to promote the hit value into this cache. What happens when we miss a value like this? Yep, it gets promoted from the bottom most cache layer all the way to the top cache with a score of 1 at every layer.

(defn cache-get!
  "Recursively fetches a value from nested caches, incurring access
  latency as it scans the nested caches."
  
  [cache key]
  (Thread/sleep (get cache :latency 0)) ;; incur read latency
  (if (contains? (:store @(:map cache)) key)
      ;; cache hit case
      ;;-------------------------------------------------
      (do (cache-hit! cache key)
          (get (:store @(:map cache)) key))

      ;; cache miss case
      ;;-------------------------------------------------
      (let [v (if (:else cache)
                (cache-get! (:else cache) key)
                0)]
        (cache-install! cache key v)
        v)))

Now we just need a way to write values back into the cache. For now, we will assume that we only have to incur the write latency to the first cache layer. Because we are operating in a single core processor, this assumption is valid. However we will see how it changes once we bring multiple processors into play at some future time.

(defn cache-write!
  "Writing to the cache sucks. However for now it doesn't suck too
  hard because we can get away without flushing the cache to main
  memory. This means that a cache write is a simple update in place
  with a cache hit. Note that this function is entirely for
  side-effects and returns no meaningful value."
  
  [cache key v]
  (Thread/sleep (get cache :latency 0))
  (-> (:map cache)
      (swap! update-in [:store] assoc key v))
  (when (:else cache)
    (cache-write! (:else cache) key v)))

Now this cache data structure is anything but simple, because it is a nested map of maps and atoms and so forth. I don't want to build one of these by hand and neither do you, so a quick helper to let us just indicate the specifications of the cache and thus generate the cache data structure is very much in order.

(defn make-cache-hierarchy
  "Builds a cache hierarchy from a pairs [latency, size]. The result
  is a nested set of empty cache maps nested in pair order.

  Ex. a sequence [[1 8] [2 16]] would build a cache of latency 1 and
  size 8 with an else cache of latency 2 and size 16."
  
  [l-size-pairs]
  (->> l-size-pairs
       (map (fn [[l size]]
              {:map (atom {:meta {} :store {} :size size})
               :latency l}))
       (reverse)
       (reduce (fn [prev c]
                 (assoc c :else prev)))))

Wrap up

Thus armed with a cache, all we have to do is wire it in behind the scenes in batbridge.common/get-memory and batbridge.common/write-memory. Now our single cycle processor will correctly bot in terms of cache consistency, cache eviction and access time simulate a simple cache hierarchy! This cache control algorithm could be smarter and we have no doubt slowed down the simulator as a whole, but it's all in the name of realism and learning so go with it :P. Next time we'll take advantage of caching in building an even faster machine: a pipelined computer!

^d

Tags