Building a cache hierarchy
04 Jan 2014Previously, we progressed from a single cycle stack machine to a register machine by attempting to escape the performance penalties brought upon us by attempting to access memory frequently. On this basis we reinvented the register machine as a mechanism which allows a programmer to escape the latency of accessing DRAM whenever possible by allowing for the manipulation of a small “working set” of data in “registers” without reading and writing to memory.
However, before we do anything about this, lets start introducing the
latency suck which is memory access to our simulator! That’s right
kids, I’m about to deliberately make my code run slower :P Don’t Try
This At Home™ In the single cycle processor I used the symbol
batbridge.common/get-memory
to read a value from the memory state of
the processor. The definition of this symbol was pretty simple:
</a>
The trouble with this is that while it works perfectly as a simulation mechanism it introduces exactly no memory access penalty! Doing scientific measurements on my dev box, a single iteration of the single cycle processor simulator takes about 0.306ms to execute a single cycle. According to wikipedia, in 1998 a memory read could take on average 50ns. In 1998 top of the line processors ran at 300MHZ, or a cycle time of 33ns, so lets add a delay of 3ms to our memory access and call it to scale. [edit] Since writing this, an awesome source for approximate latency numbers has appeared!
</a>
And for good measure I’ll throw the same 3ms delay on the
write-memory
operation just to drive the point that memory access is
evil home within the bounds of our simulator.
So memory sucks, and now our simulator is paying the price now. Every cycle this processor gets hammered first by accessing memory to read an instruction, and then in the worst case by accessing it again for either a memory read or write! This means that our cycle time is at least about 3.3ms, but could be as bad as 6.3ms in a double access case. What can we do about this?
What if we built a large-ish cache, which stores say a 64 entry subset of main memory? While we introduce some added main memory read and write latency due to having to interact with a cache controller, for the small working set of values in the cache our access latency can be greatly decreased! In fact, our access time decreases in proportion to the fraction of the time that we “hit” in our small cache storage. Say that the cache is 10x faster than main memory thanks to being smaller. This means that if say we had a cache for all the instructions (which we have to fetch every cycle!) that our threshold minimum cycle time is cut from 3.3ms to 0.6ms!
So lets take a real quick first cut at a reasonable cache model. We can only achieve this factor of 10 speedup in memory access when we can read data from the cache rather than from main memory. This means that the performance of the simulator hinges in no small part on the frequency with which we can access the cache rather than main memory, so we must be judicious in what we choose to store in the cache.
Cache control strategies
Caching is its own entire research field in computer architecture. Modern processors typically use a three level cache to make main memory access faster, and feature special caches for frequently accessed structures such as the address range in which instructions are stored. In all this, cache management is absolutely critical as ensuring that the data with which the processor will interact is cached can represent a factor of 10 speedup or more as demonstrated earlier.
One classic cache management algorithm is the “least recently used” pattern. In LRU caches, the cache is filled as values which are not in the cache are accessed. When the cache is full and a new value which is not in the cache is accessed, the value in the cache which has not been used for the longest is declared dead and “evicted” in favor of the new cache value.
Another highly effective and simple strategy is least frequently used, which when the cache is full identifies the value which has been used least and removes that from the cache. This is the algorithm which I will implement as it achieves slightly better results than LRU.
Cache simulation
Now, lets add a simple cache to the simulator. Previously, the cache was a simple map from memory addresses to values. Now, lets define a cache structure:
</a>
Note that this cache structure is recursive. That is, it allows us to nest several caches! This means that we can play Russian dolls with caches, hiding smaller, faster caches underneath larger, slower caches. Why is this valuable? Because if we can manage the caches well it means that we need never incur the full latency of accessing main memory. Say we had a three tier cache system. The first tier takes 1ms to access, the second 10ms and the 3rd (main memory) takes 100ms. If each cache has a 90% hit rate, that is we do a good job of managing our caches contents and only rarely have to go outside of a cache, what is out average access latency?
On average, we only need to spend 2.8ms accessing a value, no matter where it is in our system’s memory! Compared to the 10ms and 100ms times of the outer memories that is smoking fast! Now, this depends on having a sufficiently large cache that a 90% hit rate is possible, and on having a cache controller capable of achieving this kind of efficiency.
Okay, so what do we do when we find a value in the cache? We need to inform the cache that the value was used, and that it should be kept in the cache.
</a>
Not too bad… now we need a mechanism for choosing an element (called a victim) to remove from the cache when the cache gets full.
</a>
Okay, so we have a base case for removing things from the cache, but we need to be able to insert values into the cache first:
</a>
Now how do we access a value in a cache? We have to incur the cache
latency, see if the cache contains a value and if not search the next
cache recursively. When we find the key, we have to call
cache-hit!
, and otherwise we have to call cache-install!
to
promote the hit value into this cache. What happens when we miss a
value like this? Yep, it gets promoted from the bottom most cache layer
all the way to the top cache with a score of 1 at every layer.
</a>
Now we just need a way to write values back into the cache. For now, we will assume that we only have to incur the write latency to the first cache layer. Because we are operating in a single core processor, this assumption is valid. However we will see how it changes once we bring multiple processors into play at some future time.
</a>
Now this cache data structure is anything but simple, because it is a nested map of maps and atoms and so forth. I don’t want to build one of these by hand and neither do you, so a quick helper to let us just indicate the specifications of the cache and thus generate the cache data structure is very much in order.
</a>
Wrap up
Thus armed with a cache, all we have to do is wire it in behind the
scenes in batbridge.common/get-memory
and
batbridge.common/write-memory
. Now our single cycle processor will
correctly bot in terms of cache consistency, cache eviction and access
time simulate a simple cache hierarchy! This cache control algorithm
could be smarter and we have no doubt slowed down the simulator as a
whole, but it’s all in the name of realism and learning so go with it
:P
. Next time we’ll take advantage of caching in building an even
faster machine:
a pipelined computer!
^d