Scalable analysis of multicore data reuse and sharing
Paper in proceeding, 2014
The performance and energy efficiency of multicore systems are increasingly dominated by the costs of communication. As hardware parallelism grows, developers require more powerful tools to assess the data sharing and reuse properties of their algorithms. The reuse distance is an effective metric to study the temporal locality of programs and model private and shared caches. But the application of this method is challenging. First, generating memory traces is very expensive in storage and very intrusive on execution, possibly distorting the parallel schedule. And second, the algorithm is computationally very expensive, limiting the length, memory size and parallelism of analyzable programs. This paper introduces a novel coarse-grained reuse distance method, called Kernel Reuse Distance (KRD), which addresses these challenges. KRD enables a quick assessment of data locality by studying the reuse characteristics of the kernels' inputs and outputs. We analyze the performance of the initial prototype implementation and show two use cases comparing different parallel implementations. On a 24-core system, analyzing a trace from a matrix multiplication representing 24 threads, 1.37 terabytes of streamed data and 800 million distinct accesses, the parallel KRD implementation is able to compute the coherence-aware kernel reuse distance histogram for one socket (six cores) in 11.1 seconds. © 2014 ACM.
multithreaded runtime systems
instrumentation
reuse distance
data reuse and sharing