Architectures and Compilation Techniques for a Data-Driven Processor Array
Doctoral thesis, 1994
This thesis presents two processor array architectures and a program transformation technique, which are aimed at efficient exploitation of fine-grain parallelism. The class of architectures considered are fine-grain processor arrays with a data-driven execution model.
Scheduling of subcomputations and allocation of processing resources, are a main concern of the work presented in this thesis. More specifically, the feasibility of using static (compile-time) methods for scheduling and allocation, as well as the trade-off between static and dynamic methods that must be done in the design of a parallel computer.
There are three main parts in the thesis, describing two different processor array architectures and a program transformation technique based on partial evaluation.
The first part describes the Function Processor, which is a wavefront array architecture, that relies extensively on static allocation of processing resources. The main result of the work on the Function Processor is the combination of the architecture and a compilation technique that constitutes a generally programmable wavefront array, which does not suffer from the difficulty in programming normally associated with this type of architecture. However, experimental results show that for normal, irregular com- putations the utilization, and consequently the cost/performance ratio, of this archi- tecture is poor.
The second part of the thesis investigates partial evaluation as a means for solving these problems by transforming programs to a more regular form, using known properties about the input values to a computation. The most important effect of this is a reduction in control flow operations, that makes the control flow less dependent on the values of input data, and, hence, more regular. The experimental results show that even with comparatively little information about input data, considerable performance improvements can be achieved. Speed-up in the range of 5 to 30 were obtained, using information about the size of input data structures. The source of the speed-up is a combination of a decreased dynamic instruction count and an increased parallelism.
The third part of the thesis presents an execution model for a multithreaded processor array. This execution model is embodied in an abstract machine called S-TAM (Static Threaded Abstract Machine), which offers a flexible trade-off between static and dynamic methods for scheduling and allocation. It is based on multithreading, that supplies the flexibility in scheduling, and on an organization consisting of multiple statically allocated processors, with multiple dynamically allocated functional units, that gives a flexibility in the allocation of processing resources. There are thus many possible configurations of S-TAM, differing in the number of processors and the number of functional units. Experimental results indicate that, for most computations, a configuration offering a mix of static and dynamic allocation, i.e. a configuration with multiple processors and multiple functional units per processor, offers the best cost/performance ratio.