next up previous
Next: Conclusions Up: Results Previous: A Surprising Result

Experimental Results

In this section, we present the results of our study. Firstly, we ran our cache management schemes on different benchmark programs to find out how well they performed in practice. This is presented in figure [*]. The cache misses are expressed in percentage terms. For this experiment, we emulated a cache of size 32 KB, that was 4-way set associative and had a line size of 32 bytes, except for 'epic', for which the cache size was 64 KB. We varied the cache size because we wanted different benchmarks to have approximately the same number of cache misses. We have split the results in two graphs and have used bezier curves instead of straight lines in the graphs for the sake of clarity. The victim cache size for all the experiments was ${\frac{1}{512}}$ of the main cache size, whereas the hot set in the hotornot scheme was ${\frac{1}{128}}$ of the cache size. Also, the number of queues maintained for the LRU_over_LFU and LFU_over_LRU schemes was 2 in each case.

Figure: Cache Misses for different policies. Key to benchmarks: 1 epic , 2 cellauto.c.1000-100 , 3 matmult.c.64 , 4 qsort.c.50000 , 5 queens.c.10 , 6 wordfreq.c.rev, 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15

As expected, the Random policy performs pretty badly, while the LRU and LFU policies work well. The LRU policy usually is better than LFU. Interestingly, the SIDE algorithm does pretty well, with performance close to LFU, though it maintains a lot less overhead. The hybrid schemes also do well, being usually between LRU and LFU in performance. This suggests an adaptive mix of the two schemes will do well. Also, having a victim cache improves performance tremendously, even though replacement policies used there is FIFO, which otherwise doesn't perform as well. Thus we validate that having a victim cache can improve performance. Among other schemes, the hotornot policy stands out. Our simple prefetch scheme is not able to detect many patterns and so, in most cases, its performance closely follows the LRU performance. Although we do not present the results for diffblock, since we could not get all the benchmarks to run on Alpha due to lack of time, we noticed that it does not perform as good as LRU. This suggests that a static division of cache among the heap, stack and library address spaces will not work, and we need to look at dynamic methods of dividing the available cache space or have a profile-based division corresponding to each large program.

Figure: Cache Misses for different cache sizes.

Figure: Cache Misses for different associativity.

Next, we wanted to find out how our policies perform with changes in cache parameters. We decided to vary associativity and cache size independently for this purpose. All these experiments were performed on a single benchmark, the Java compiler javac compiling a spreadsheet application. The result of varying the cache size from 8 KB to 128 KB are shown in figure [*]. This represents the approximate sizes of cache's available today. In fact,today's caches go up to 512 KByte and larger, but we limited ourselves to 128 KByte because we ran into memory problems with larger cache sizes. All the other parameters were the same as the previous experiment. As expected, the cache miss rate drops with increasing cache size. In this experiment, hotornot emerges a clear winner and the performance difference is definitely perceptible for small cache sizes. As the cache size increases beyond the active data set of the program, as expected, only cold misses occur, and all policies converge to a common value. This also causes the performance of the ``bad'' policies to fall more sharply than the others as the cache size is increased. Finally, there is no data point for the victim cache scheme for 8 KB size because the victim cache size is 16 bytes, which is not sufficient to store even a single line.

For figure [*], we took a 16 KByte cache with line size 32, and studied the effects of varying associativity from one (direct-mapped cache) to a 16 way set-associative, which we believe is the limit of feasibility. Larger values of associativity would increase hit times too much. In fact, few modern processors even have 8 way associative caches. As expected (except for some pathological cases, one of which is detailed above), the cache miss rate drops with increasing associativity. One exception was the side scheme, for which a lower associativity works. This seems to suggest that for some data access patterns, the simple partitioning scheme of side does not work well when there are a large number of elements in the set. Our cache size is only 16 KB for this experiment. The reason we took a smaller cache is because when we had initially tried the experiment for a larger cache, the fall in the cache miss rate with increasing associativity was barely noticeable. Thus a large cache can get away with lower associativity, which is embodied by the so-called $2:1$ cache rule of thumb. A $N$ sized cache with associativity 1 performs almost equally as a $N/2$ sized cache with associativity 2. Thus low associativity is not a bad thing if our cache is large, or to put it another way, if our programs are not absolutely huge. We also see that the LRU_ over_LFU policy performs relatively better as the associativity increases. This is due to the fact that in general, LRU performs better than LFU, and as associativity increases, the scheme can take more advantage of the LRU scheme between the sub-queues.

Lastly, as we have stressed before, the increase in hit time that a policy causes should not be neglected. We had no way of estimating this cost in our current setup. However, we thought we needed to address this issue, and as a first cut, we studied how many dirty blocks were replaced by a particular scheme. If a modified or ``dirty'' block is replaced by the strategy, the cache miss penalty is likely to be higher as the block has to be written out to memory.

To explore this trade-off, we implemented a simple second chance algorithm to control dirty replacements. We took a 4-way set associative cache which had a size of 32 KByte, and measured how many dirty lines were evicted. The results are shown in Table [*]. The dirty miss rate represents the percentage of evicted cache lines that were dirty.

Table: Cache Misses and Dirty Cache Replacements.
Benchmarks LRU LRU with $2^{nd}$ chance
  Dirty Miss Rate Cache Miss Rate Dirty Miss Rate Cache Miss Rate
cellauto.c 0.000000 0.007131 0.000000 0.007131 65.038391 1.230620 65.021670 1.230343
epic 43.031039 1.327477 43.009294 1.327847
go.alpha 72.217353 4.755708 72.109307 4.762191 77.174208 0.191514 76.342888 0.191739 24.949563 1.210390 24.865591 1.211204 55.758559 0.643140 55.702783 0.643678
linpack.c 86.575478 4.712255 86.562911 4.712869 86.308468 1.179416 86.201878 1.180649
matmult.c 9.936515 1.062921 9.932818 1.063317 16.543291 0.675483 16.527272 0.676071
qsort.c 71.599780 1.600386 71.585903 1.600334 71.653068 0.217880 71.600355 0.218012
queens.c 0.000000 0.002678 0.000000 0.002678 69.076642 0.321980 69.076642 0.321980
wordfreq.c.rev 40.539468 0.066736 39.968342 0.066894 44.442785 0.564729 44.137602 0.565691

What we would normally expect is that as we limit the dirty store (such as in our second chance algorithm), the cache misses suffered by the program would rise. This is the case for most programs. But, for some programs, for example, qsort.c and, both numbers fall, which is an extremely good thing. This suggests that this trade-off must be further explored on other benchmarks. Moreover, the gains from the second chance algorithm in both the favorable cases is barely perceptible. This suggests that more aggressive algorithms that aggressively constrain the write back of dirty lines should be considered.

next up previous
Next: Conclusions Up: Results Previous: A Surprising Result
Amit K Manjhi 2001-12-04