Strategies to Reduce Energy and Resources in Chip Multiprocessor Systems
A new architectural style known as chip multiprocessor (CMP) has recently emerged, where two or more processor cores are manufactured on the same die. This architectural style comes with many promises such as high performance for applications with much thread-level parallelism (TLP) and shorter design times due to its modularized design. Nevertheless, a new architectural paradigm also introduces new design challenges.
This thesis addresses the technical problem of how to design more efficient CMP systems in terms of energy and memory utilization. It contributes with design strategies that fall into three categories, consisting of design principles to reduce energy dissipation and main memory resources without reducing performance, design recommendations to balance the exploited instruction level parallelism (ILP) and TLP in a CMP, and a methodology to reduce simulation time when evaluating future designs.
Two of the proposed design principles can together reduce the energy dissipation in the L1-caches and translation-lookaside buffers by 30%. Secondly, it is shown that it is possible to tolerate ten times longer access latency to 70% of the main memory, which can be exploited by compression techniques to reduce the amount of memory by 30%. A novel compression scheme applied to the entire main memory is proposed and evaluated and is shown to reduce the needed memory resources by 30%. These reductions do not have any significant negative effect on performance.
Further, the trade-off between TLP and ILP is studied for applications with an abundance of TLP under a fixed area constraint. Four different design points ranging from 16 single-issue cores to two eight-issue cores are evaluated. By choosing the design point with four cores with an issue width of four it is possible to achieve close to optimal performance while still enabling single-threaded applications to run well.
Finally, a sampling technique for single-processor simulation is applied to multiprocessors. For one class of applications the technique is more efficient for multiprocessors by reducing the number of simulation points linearly with the number of processors. Another statistical technique is then proposed both for single and multiprocessor systems and reduces the required number of simulation points by one order of magnitude.