Next: Experimental setup Up: CS-740 Project Report Exploring Previous: CS-740 Project Report Exploring

Introduction

The performance gap between the processor and memory has been increasing over the years. In contrast to the memory access speeds, that are seeing an annual improvement of only 25%, processor clock speeds are increasing by about 60% every year. The problem worsens with every new enhancement of the processor architecture.

**Figure:** Figure showing performance-memory gap.
$\includegraphics[width=3in]{index.ps}$

This has resulted in memory latency becoming the bottleneck for any program that does a significant number of memory accesses. An initial solution to the problem was to introduce a level of cache (smaller, faster but expensive memory) between the processor and memory that hides the memory latency by exploiting the spatial and temporal locality present in real-world programs[1]. But, as the performance gap continued to widen, a single level of cache proved to be insufficient, and multiple level of caches were needed to prevent memory latency from becoming a serious bottleneck. In spite of this, memory latency continues to be a problem that would only aggravate with time. Also, the additions to memory hierarchy are seeing diminishing returns to scale, i.e. soon enough, benefits of adding a new level of cache would increase the memory access time. This makes it imperative that we use the available cache space in a more judicious fashion, and look at other ways to decrease the memory access time.

One way to do this is to increase the associativity of the cache and have a more optimal cache placement/replacement technique, that minimizes the number of cache misses.

There is an important trade-off here. While increasing the complexity of the cache placement/replacement decisions, we are slowing down all memory references that need to pass through this cache. Our benefit, which is the reduction in the number of cache misses times the time saved in going to the memory as compared to a cache hit, must be more than the cost incurred to make any such scheme feasible. But, we believe that as the gap between the memory and the processor performance continues to increase, many schemes which are infeasible now will become feasible in the future [2].

We carried out a study to explore this way of reducing memory latency and present the result in this paper. We actually looked at only one aspect of the complete solution, i.e. the varying number of cache misses that a L2 data cache suffers when following different cache management strategies and when its size and associativity are changed. In the process, we introduce some uncommon cache placement/replacement policy that we believe are optimal for some commonly occurring data access pattern. We do not provide any quantitative measurements of whether such a policy is feasible or not. But, still, we have tried to avoid schemes that have an extremely high overhead in terms of space or time. For all the less well-known schemes, we only provide an analytical estimate for their space-time overheads. Finally, once we had the setup ready, it only required a slight extension to enable us to explore the trade-off between dirty stores and cache misses. We also quantify this trade-off.

In our study, we focused on the L2 cache since it is the first level of cache that is not constrained by the clock speed. Our other design goal was to only use the information derivable from the address stream that the cache has seen, that is, we focus only on history-based caching policies. Otherwise, profiling based techniques could be used to further optimize the cache misses, by adopting a more cache-conscious data placement[3][5]. The reason we chose to study the data cache instead of the instruction cache is because the data access patterns of a program are a lot less structured than the instruction access patterns, and thus, offer correspondingly more avenues for improvement and impacting the program performance. Here, we seek to minimize only the conflict misses and capacity misses and not worry about cold misses, for which techniques like prefetching and increasing the cache line size already do a good job[4].

The rest of the paper is organized as follows. Section 2 deals with our experimental setup. In section 3, we talk of the different schemes we tested, and the next section contains descriptions of the benchmarks we used. In section 5, we present our results, and try to explain them based on our understanding of the policies. Finally, we provide the conclusions we drew from this study.

Next: Experimental setup Up: CS-740 Project Report Exploring Previous: CS-740 Project Report Exploring

Amit K Manjhi 2001-12-04