|
As expected, the Random policy performs pretty badly, while the LRU and LFU policies work well. The LRU policy usually is better than LFU. Interestingly, the SIDE algorithm does pretty well, with performance close to LFU, though it maintains a lot less overhead. The hybrid schemes also do well, being usually between LRU and LFU in performance. This suggests an adaptive mix of the two schemes will do well. Also, having a victim cache improves performance tremendously, even though replacement policies used there is FIFO, which otherwise doesn't perform as well. Thus we validate that having a victim cache can improve performance. Among other schemes, the hotornot policy stands out. Our simple prefetch scheme is not able to detect many patterns and so, in most cases, its performance closely follows the LRU performance. Although we do not present the results for diffblock, since we could not get all the benchmarks to run on Alpha due to lack of time, we noticed that it does not perform as good as LRU. This suggests that a static division of cache among the heap, stack and library address spaces will not work, and we need to look at dynamic methods of dividing the available cache space or have a profile-based division corresponding to each large program.
Next, we wanted to find out how our policies perform with changes in cache parameters. We decided to vary associativity and cache size independently for this purpose. All these experiments were performed on a single benchmark, the Java compiler javac compiling a spreadsheet application. The result of varying the cache size from 8 KB to 128 KB are shown in figure . This represents the approximate sizes of cache's available today. In fact,today's caches go up to 512 KByte and larger, but we limited ourselves to 128 KByte because we ran into memory problems with larger cache sizes. All the other parameters were the same as the previous experiment. As expected, the cache miss rate drops with increasing cache size. In this experiment, hotornot emerges a clear winner and the performance difference is definitely perceptible for small cache sizes. As the cache size increases beyond the active data set of the program, as expected, only cold misses occur, and all policies converge to a common value. This also causes the performance of the ``bad'' policies to fall more sharply than the others as the cache size is increased. Finally, there is no data point for the victim cache scheme for 8 KB size because the victim cache size is 16 bytes, which is not sufficient to store even a single line.
For figure , we took a 16 KByte cache with line size 32, and studied the effects of varying associativity from one (direct-mapped cache) to a 16 way set-associative, which we believe is the limit of feasibility. Larger values of associativity would increase hit times too much. In fact, few modern processors even have 8 way associative caches. As expected (except for some pathological cases, one of which is detailed above), the cache miss rate drops with increasing associativity. One exception was the side scheme, for which a lower associativity works. This seems to suggest that for some data access patterns, the simple partitioning scheme of side does not work well when there are a large number of elements in the set. Our cache size is only 16 KB for this experiment. The reason we took a smaller cache is because when we had initially tried the experiment for a larger cache, the fall in the cache miss rate with increasing associativity was barely noticeable. Thus a large cache can get away with lower associativity, which is embodied by the so-called cache rule of thumb. A sized cache with associativity 1 performs almost equally as a sized cache with associativity 2. Thus low associativity is not a bad thing if our cache is large, or to put it another way, if our programs are not absolutely huge. We also see that the LRU_ over_LFU policy performs relatively better as the associativity increases. This is due to the fact that in general, LRU performs better than LFU, and as associativity increases, the scheme can take more advantage of the LRU scheme between the sub-queues.
Lastly, as we have stressed before, the increase in hit time that a policy causes should not be neglected. We had no way of estimating this cost in our current setup. However, we thought we needed to address this issue, and as a first cut, we studied how many dirty blocks were replaced by a particular scheme. If a modified or ``dirty'' block is replaced by the strategy, the cache miss penalty is likely to be higher as the block has to be written out to memory.
To explore this trade-off, we implemented a simple second chance algorithm to control dirty replacements. We took a 4-way set associative cache which had a size of 32 KByte, and measured how many dirty lines were evicted. The results are shown in Table . The dirty miss rate represents the percentage of evicted cache lines that were dirty.
|
What we would normally expect is that as we limit the dirty store (such as in our second chance algorithm), the cache misses suffered by the program would rise. This is the case for most programs. But, for some programs, for example, qsort.c and cellauto.java, both numbers fall, which is an extremely good thing. This suggests that this trade-off must be further explored on other benchmarks. Moreover, the gains from the second chance algorithm in both the favorable cases is barely perceptible. This suggests that more aggressive algorithms that aggressively constrain the write back of dirty lines should be considered.