Assist threads

It’s been a while since I’ve written about hardware proper rather than just lisp machine weirdness, but I’m doing research reading in computer architecture this semester so that’s gotta change.

Today’s subject is the caches and assist threads. We are at a crossing point in the history of computer architecture design. Transistors are still getting smaller, but the end of the line is in sight and it has been clear for some time that the next big performance wins will come from parallelism. We can build chips with dozens, hundreds of cores even, the problem is how can we make use of all of those cores?

Two answers to this question which architects have messed with over the years are the “split architecture”, and the modern derivative thereof “assist threads”.

The Windup

As I previously wrote, caches have been a must in computer architectures for a long long time. DRAM has always been slow and compared to CPUs has been slow to improve as well leading to a huge latency mismatch between cores and memory. However analysis of program behavior shows that there is a huge amount of data locality in the way that we tend to write programs. That is we rarely actually work with several gigabytes of data “at the same time” (in a few thousand clock cycles). Instead there are small hot regions of memory such as program segments and data segments which are accessed over and over again.

Intel and other chip manufacturers have long exploited this behavior with two and three level nested cache systems which succeed in reducing memory read times by 100x or more (1), (2).

However caches much like memories are not infinite no matter how harmful we may consider this and in fact derive much of their decreased access time from comparatively small size. (memory read times increase with the layout edge lengths, approximately twice the square root of their size) And this introduces the problem of what do we store in the cache so that we maximize the cache utilization?

Replacement Policies

There is an entire field of cache replacement policies, which attempt to predict which entries in a cache will not be accessed again. An ideal cache with perfect knowledge of the future access pattern will never replace an item which will be used again so soon that its eviction and replacement will cost more time than simply leaving it in the cache. I’ll say more about this another time, but this is something modern hardware has gotten very good at.

Data Prefetching

As previously explained in other works, caches are great when you can read from them, but can’t save you when the data you want isn’t cached. Because missing in the cache can cost hundreds of cycles, it’d be great if we could use some hardware or software mechanism(s) to predict what data will be needed in the cache and dedicate some fraction of the cache to storing speculated values. This is called prefetching (3) turns out to work quite well.

The obvious algorithm, stream prefetching, tries to recognize sequential scans as generated by memcpy, strcmp etc. and prefetch cache lines before you read them. While scanning a region this can eliminate the dreaded latency of reading main memory and in the worst case fetches too much data creating some tiny amount of cache pollution at the end of the scan.

Stride prefetching tries to recognize a similar streaming pattern akin to walking an array of records accessing some single field in each record. That is memory reads where the nth read is some linear function addr(n)=base+n*offset for offset greater than the size of a cache line.

The fundamental limitation with these and other memory access prediction technologies however is that they deal very poorly with indirection. The classical example of this is that in object oriented programs many layers of object indirection traversal may be required before the CPU can actually begin performing computation. Furthermore as objects may in the worst case occur anywhere in memory dependent on the behavior of the memory allocation engine and/or garbage collector as in general there exists no stride or stream which you could predict. There exist techniques (4) among them for predicting such accesses, but they have not seen wide adoption.

Split Architectures

In the ’90s, computer architects proposed what was called a “split architecture”, a design where each CPU features an additional programmable cache control processor. The concept was that for any program P you could write a program Q such that Q “knows” the memory access pattern(s) of P and can provide perfect cache management for P.

This is in theory superior to any hardware based prediction mechanism, because the programmer or the compiler can predict arbitrary memory access patterns no matter how irregular or seemingly unstructured.

Lots of really smart researchers messed with this idea, and nobody managed to get it to really work in part because predicting the memory behavior of a program is a lot like predicting its branching behavior. It can’t really be done without a total understanding of the inputs which the program will receive. Even if it can be done, as with software branch prediction especially on general purpose programs hardware mechanisms will quickly (in a few hundred cycles or less) learn all the behavior which software techniques are able to predict ahead of time and thereafter are able to predict. So split architectures have been largely a failure.

The Pitch

Which brings us at long last to the idea of an assist thread. Writing the cache control program called for in a split architecture is really hard in the best case. I would argue that it’s generally intractable, and certainly something most programmers shouldn’t mess with. But we already have the program we want to run and we have more than one CPU core with nothing better to do. So what if we ran two or more instances of our program, with one actually doing the work and the others speculatively executing loads? We could see what the other instances are loading and decide how to populate the cache for the “real” instance based on what the others load.

This is way easier for both the compiler and the programmer than the split architecture, because we just punt on the whole cache management problem to the hardware cache management mechanisms we’ve developed (which are quite good). And it does a really good job of predicting memory accesses! What other computation than the program which we wish to run will exhibit or predict the same memory access pattern?

Now is this a win or not compared to having some general predictive hardware mechanism? I honestly don’t know. Engineering is about taking constant factors and pushing them around to locally optimize for what we can do here and now. Here and now both from a manufacturing and power consumption standpoint I’m hard pressed to believe that it makes sense to take a pair of beefy out of order processors and throw one of them away essentially to accelerate the other.

However if we imagine some machine with hundreds of inexpensive cores, I think that this technique begins to make a lot of sense. Given the choice between two middling CPUs which could team up to approximate a better CPU and a better CPU which invests a bunch of chip area in predictive hardware structures clearly the two CPUs is a more generally useful machine configuration because it can run both the set of tasks which the better core would be useful for and a potentially interesting set of much smaller tasks on which the large core would be wasted.

This theme of many, many cheap cores and their relative worth to the large cores which we have today is something which will come up over and over again because engineering is in the constants and the constants are making the future of big cores less and less viable.