Data Prefetching Techniques Targeting Single and a Network of Processing Nodes
This thesis considers two approaches to the design of high-performance computers. In a single processing node with one processor, performance is degraded when requested data is not found in the cache, because it has to be retrieved from slower memory. In a network of processing nodes, performance is also degraded when the requested data is not even found in the node's own memory, as it has to be retrieved from the memory of another node. This thesis addresses performance bottlenecks of these two types by using a class of techniques called data prefetching techniques. A data prefetching technique speculatively fetches data closer to the processor before the data is actually needed.
This thesis considers previously proposed data prefetching techniques as well as proposing new techniques for processing nodes both singly and in networks. To evaluate these approaches, they have been implemented in detailed simulation models of the systems under investigation. The relative performance of the techniques was established using programs executed on the models.
First the effectiveness of caches for a decision-support system (DSS), a data-intensive application, is evaluated using a novel experimental approach that combines both analytical modeling and simulations. It was found that there are performance degrading cache misses in this application on a single processing node system that cannot be removed with reasonably sized caches. An analysis of the effects of previous hardware data prefetching techniques on emerging applications such as DSS shows they are usually not effective for these applications, even though the accesses they target are present in the applications. One type of accesses that cannot be prefetched with the previous hardware techniques are certain accesses to list and tree data structures. It is shown that previous software techniques are inadequate to the task, and instead a new prefetching technique targeting such data structures is developed and evaluated. It is shown to be capable of prefetching accesses that other previously proposed techniques were not able to prefetch. For some applications nearly all cache misses were successfully eliminated with this technique.
The performance bottlenecks in a baseline network of processing nodes system are also identified. The main bottleneck is the high transmission overhead of the network interfaces, while the network switch provides ample transmission capacity. In order to lessen the impact of the high transmission overhead on the execution time of parallel applications, a prefetching technique was developed that prefetches data from the memories of other nodes. The novel feature of the technique is that it records accesses in the system and continually searches through this record to find repeated access patterns. If it finds such a pattern it will prefetch the data involved. The technique is found to reduce the stall time for the applications used. However, the system still experiences performance degradation due to synchronization messages, and for this reason another prefetching technique for synchronization data was developed. This technique reduces both the synchronization stall time and the network traffic by up to three quarters for some applications.