Simple Penalty-Sensitive Cache Replacement Policies
Artikel i vetenskaplig tidskrift, 2008

Classic cache replacement policies assume that miss costs are uniform. However, the correlation between miss rate and cache performance is not as straightforward as it used to be. Ultimately, the true performance cost of a miss should be its access penalty, i.e. the actual processing bandwidth lost because of the miss. Contrary to loads, the penalty of stores is mostly hidden in modern processors. To take advantage of this observation, we propose a simple scheme to replace load misses by store misses. We extend LRU (Least Recently Used) to reduce the aggregate miss penalty instead of the miss count. The new policy is called PS-LRU (Penalty-Sensitive LRU) and is deployed throughout most of this paper. PS-LRU systematically replaces first a block predicted to be accessed with a store next. This policy minimizes the number of load misses to the detriment of store misses. One key issue in this policy is to predict the next access type to a block, so that higher replacement priority is given to blocks that will be accessed next with a store. We introduce and evaluate various prediction schemes based on instructions and broadly inspired from branch predictors. To guide the design we run extensive trace-driven simulations on eight Spec95 benchmarks with a wide range of cache configurations and observe that PS-LRU yield positive load miss improvements over classic LRU across most the benchmarks and cache configurations. In some cases the improvements are very large. Although the total number of load misses is minimized under our simple policy, the number of store misses and the amount of memory traffic both increase. Moreover store misses are not totally "free". To evaluate this trade-off, we apply DCL and ACL (two previously proposed cost-sensitive LRU policies) to the problem of load/store miss penalty. These algorithms are more competitive than PS-LRU. Both DCL and ACL provide attractive trade-offs in which less load misses are saved than in PS-LRU, but the store miss traffic is reduced.

Författare

J Jeong

Intel Corporation

Per Stenström

Chalmers, Data- och informationsteknik, Datorteknik

Michel Dubois

University of Southern California

Journal of Instruction-Level Parallelism

1-24

Ämneskategorier

Data- och informationsvetenskap