One way to do this is to increase the associativity of the cache and have a more optimal cache placement/replacement technique, that minimizes the number of cache misses.
There is an important trade-off here. While increasing the complexity of the cache placement/replacement decisions, we are slowing down all memory references that need to pass through this cache. Our benefit, which is the reduction in the number of cache misses times the time saved in going to the memory as compared to a cache hit, must be more than the cost incurred to make any such scheme feasible. But, we believe that as the gap between the memory and the processor performance continues to increase, many schemes which are infeasible now will become feasible in the future [2].
We carried out a study to explore this way of reducing memory latency and present the result in this paper. We actually looked at only one aspect of the complete solution, i.e. the varying number of cache misses that a L2 data cache suffers when following different cache management strategies and when its size and associativity are changed. In the process, we introduce some uncommon cache placement/replacement policy that we believe are optimal for some commonly occurring data access pattern. We do not provide any quantitative measurements of whether such a policy is feasible or not. But, still, we have tried to avoid schemes that have an extremely high overhead in terms of space or time. For all the less well-known schemes, we only provide an analytical estimate for their space-time overheads. Finally, once we had the setup ready, it only required a slight extension to enable us to explore the trade-off between dirty stores and cache misses. We also quantify this trade-off.
In our study, we focused on the L2 cache since it is the first level of cache that is not constrained by the clock speed. Our other design goal was to only use the information derivable from the address stream that the cache has seen, that is, we focus only on history-based caching policies. Otherwise, profiling based techniques could be used to further optimize the cache misses, by adopting a more cache-conscious data placement[3][5]. The reason we chose to study the data cache instead of the instruction cache is because the data access patterns of a program are a lot less structured than the instruction access patterns, and thus, offer correspondingly more avenues for improvement and impacting the program performance. Here, we seek to minimize only the conflict misses and capacity misses and not worry about cold misses, for which techniques like prefetching and increasing the cache line size already do a good job[4].
The rest of the paper is organized as follows. Section 2 deals with our experimental setup. In section 3, we talk of the different schemes we tested, and the next section contains descriptions of the benchmarks we used. In section 5, we present our results, and try to explain them based on our understanding of the policies. Finally, we provide the conclusions we drew from this study.