Techniques to Reduce Thread-Level Speculation Overhead
The traditional single-core processors are being replaced by chip multiprocessors (CMPs) where several processor cores are integrated on a single chip. While this is beneficial for multithreaded applications and multiprogrammed workloads, CMPs do not provide performance improvements for single-threaded applications. Thread-level speculation (TLS) has been proposed as a way to improve single-thread performance on such systems. TLS is a technique where programs are aggressively parallelized at run-time -- threads speculate on data and control dependences but have to be squashed and start over in case of a dependence violation. Unfortunately, various sources of overhead create a major performance problem for TLS.
This thesis quantifies the impact of overheads on the performance of TLS systems, and suggests remedies in the form of a number of overhead-reduction techniques. These techniques target run-time parallelization that do not require recompilation of sequential binaries. The main source of parallelism investigated in this work is module continuations, i.e. functions or methods are run in parallel with the code following the call instruction. Loops is another source.
Run-length prediction, a technique aimed at reducing the amount of short threads, is introduced. An accurate predictor that avoids short threads, or dynamically unrolls loops to increase thread lengths, is shown to improve speedup for most of the benchmarks applications. Another novel technique is misspeculation prediction, which can remove most of the TLS overhead by reducing the number of misspeculations.
The interaction between thread-level parallelism and instruction-level parallelism is studied -- in many cases, both sources can be exploited for additional performance gains, but in some cases there is a trade-off. Communication overhead and memory-level parallelism are found to play an important role. For some applications, prefetching from threads that are squashed contributes more to speedup than parallel execution. Finally, faster inter-thread communication is found to give simulataneous multithreaded (SMT) processors an advantage as the basis for TLS machines.